Forbid trailing whitespace (#53406)

Summary: Context: https://github.com/pytorch/pytorch/pull/53299#discussion_r587882857 These are the only hand-written parts of this diff: - the addition to `.github/workflows/lint.yml` - the file endings changed in these four files (to appease FB-internal land-blocking lints): - `GLOSSARY.md` - `aten/src/ATen/core/op_registration/README.md` - `scripts/README.md` - `torch/csrc/jit/codegen/fuser/README.md` The rest was generated by running this command (on macOS): ``` git grep -I -l ' $' -- . ':(exclude)**/contrib/**' ':(exclude)third_party' | xargs gsed -i 's/ *$//' ``` I looked over the auto-generated changes and didn't see anything that looked problematic. Pull Request resolved: https://github.com/pytorch/pytorch/pull/53406 Test Plan: This run (after adding the lint but before removing existing trailing spaces) failed: - https://github.com/pytorch/pytorch/runs/2043032377 This run (on the tip of this PR) succeeded: - https://github.com/pytorch/pytorch/runs/2043296348 Reviewed By: walterddr, seemethere Differential Revision: D26856620 Pulled By: samestep fbshipit-source-id: 3f0de7f7c2e4b0f1c089eac9b5085a58dd7e0d97
2025-10-20 12:54:11 +08:00 · 2021-03-05 17:19:22 -08:00
parent cab2689eb1
commit 8c798e0622
238 changed files with 799 additions and 798 deletions
--- a/.circleci/scripts/binary_ios_test.sh
+++ b/.circleci/scripts/binary_ios_test.sh
@ -24,6 +24,6 @@ rm cert.txt
 if ! [ -x "$(command -v xcodebuild)" ]; then
    echo 'Error: xcodebuild is not installed.'
    exit 1
-fi 
+fi
 PROFILE=PyTorch_CI_2021
 ruby ${PROJ_ROOT}/scripts/xcode_build.rb -i ${PROJ_ROOT}/build_ios/install -x ${PROJ_ROOT}/ios/TestApp/TestApp.xcodeproj -p ${IOS_PLATFORM} -c ${PROFILE} -t ${IOS_DEV_TEAM_ID}
--- a/.github/workflows/lint.yml
+++ b/.github/workflows/lint.yml
@ -40,6 +40,9 @@ jobs:
          rm -r "shellcheck-${scversion}"
          shellcheck --version
          .jenkins/run-shellcheck.sh
+      - name: Ensure no trailing spaces
+        run: |
+          (! git grep -I -l ' $' -- . ':(exclude)**/contrib/**' ':(exclude)third_party' || (echo "The above files have trailing spaces; please remove them"; false))
      - name: Ensure no tabs
        run: |
          (! git grep -I -l $'\t' -- . ':(exclude)*.svg' ':(exclude)**Makefile' ':(exclude)**/contrib/**' ':(exclude)third_party' ':(exclude).gitattributes' ':(exclude).gitmodules' || (echo "The above files have tabs; please convert them to spaces"; false))
--- a/.jenkins/caffe2/bench.sh
+++ b/.jenkins/caffe2/bench.sh
@ -21,7 +21,7 @@ if (( $num_gpus == 0 )); then
 fi
 if (( $num_gpus >= 1 )); then
    "$PYTHON" "$caffe2_pypath/python/examples/imagenet_trainer.py" --train_data null --batch_size 128 --epoch_size 12800 --num_epochs 2 --num_gpus 1
-    # Let's skip the fp16 bench runs for now, as it recompiles the miopen kernels and can take 10+min to run. 
+    # Let's skip the fp16 bench runs for now, as it recompiles the miopen kernels and can take 10+min to run.
    # We can resume when we (1) bindmount the miopen cache folder in jenkins; (2) install the pre-compiled miopen kernel library in the docker
    # "$PYTHON" "$caffe2_pypath/python/examples/imagenet_trainer.py" --train_data null --batch_size 256 --epoch_size 25600 --num_epochs 2 --num_gpus 1 --float16_compute --dtype float16
 fi
--- a/CONTRIBUTING.md
+++ b/CONTRIBUTING.md
@ -159,7 +159,7 @@ with `brew install cmake` if you are developing on MacOS or Linux system.
    check whether your Git local or global config file contains any `submodule.*` settings. If yes, remove them and try again.
    (please reference [this doc](https://git-scm.com/docs/git-config#Documentation/git-config.txt-submoduleltnamegturl) for more info).

-  - If you encountered error such as 
+  - If you encountered error such as
    ```
    fatal: unable to access 'https://github.com/pybind11/pybind11.git': could not load PEM client certificate ...
    ```
@ -169,11 +169,11 @@ with `brew install cmake` if you are developing on MacOS or Linux system.
    openssl x509 -noout -in <cert_file> -dates
    ```

-  - If you encountered error that some third_party modules are not checkout correctly, such as 
+  - If you encountered error that some third_party modules are not checkout correctly, such as
    ```
    Could not find .../pytorch/third_party/pybind11/CMakeLists.txt
    ```
-    remove any `submodule.*` settings in your local git config (`.git/config` of your pytorch repo) and try again.  
+    remove any `submodule.*` settings in your local git config (`.git/config` of your pytorch repo) and try again.

 ## Nightly Checkout & Pull

--- a/GLOSSARY.md
+++ b/GLOSSARY.md
@ -1,4 +1,4 @@
-# PyTorch Glossary 
+# PyTorch Glossary

 - [PyTorch Glossary](#pytorch-glossary)
 - [Operation and Kernel](#operation-and-kernel)
@ -39,7 +39,7 @@ For example, this
 to create Custom Operations.

 ## Kernel
-Implementation of a PyTorch operation, specifying what should be done when an 
+Implementation of a PyTorch operation, specifying what should be done when an
 operation executes.

 ## Compound Operation
@ -57,7 +57,7 @@ Same as Compound Operation.
 ## Leaf Operation
 An operation that's considered a basic operation, as opposed to a Compound
 Operation. Leaf Operation always has dispatch functions defined, usually has a
-derivative function defined as well. 
+derivative function defined as well.

 ## Device Kernel
 Device-specific kernel of a leaf operation.
@ -79,4 +79,4 @@ using just-in-time compilation.

 ## Scripting
 Using `torch.jit.script` on a function to inspect source code and compile it as
-TorchScript code.
+TorchScript code.
--- a/aten/src/ATen/BatchingRegistrations.cpp
+++ b/aten/src/ATen/BatchingRegistrations.cpp
@ -300,7 +300,7 @@ Tensor trace_backward_batching_rule(const Tensor& grad, IntArrayRef input_sizes)
  auto grad_input = at::zeros(grad_physical.getPhysicalShape(input_sizes), grad.options());
  // Batched Diagonal View
  auto grad_input_diag = at::diagonal(grad_input, /*offset*/0, /*dim1*/-2, /*dim2*/-1);
-  // Append a dimension of size one to the grad output 
+  // Append a dimension of size one to the grad output
  auto grad_physical_tensor = grad_physical.tensor().unsqueeze(-1);
  grad_input_diag.copy_(grad_physical_tensor);
  return grad_physical.getPhysicalToLogicalMap().apply(grad_input);
--- a/aten/src/ATen/CPUGeneratorImpl.cpp
+++ b/aten/src/ATen/CPUGeneratorImpl.cpp
@ -38,7 +38,7 @@ struct CPUGeneratorImplStateLegacy {
 * new data introduced in at::CPUGeneratorImpl and the legacy state. It is used
 * as a helper for torch.get_rng_state() and torch.set_rng_state()
 * functions.
- */ 
+ */
 struct CPUGeneratorImplState {
  CPUGeneratorImplStateLegacy legacy_pod;
  float next_float_normal_sample;
@ -119,7 +119,7 @@ uint64_t CPUGeneratorImpl::seed() {
 * must be a strided CPU byte tensor and of the same size as either
 * CPUGeneratorImplStateLegacy (for legacy CPU generator state) or
 * CPUGeneratorImplState (for new state).
- * 
+ *
 * FIXME: Remove support of the legacy state in the future?
 */
 void CPUGeneratorImpl::set_state(const c10::TensorImpl& new_state) {
--- a/aten/src/ATen/SparseTensorUtils.h
+++ b/aten/src/ATen/SparseTensorUtils.h
@ -94,7 +94,7 @@ TORCH_API Tensor flatten_indices(const Tensor& indices, IntArrayRef full_size, b
 //   new_indices = [ 3, 1, 3 ]  # uncoalesced
 TORCH_API Tensor flatten_indices_by_dims(const Tensor& indices, const IntArrayRef& sizes, const IntArrayRef& dims_to_flatten);

-// Find the CSR representation for a row `indices` from the COO format 
+// Find the CSR representation for a row `indices` from the COO format
 TORCH_API Tensor coo_to_csr(const int64_t* indices, int64_t dim, int64_t nnz);

 }} // namespace at::sparse
--- a/aten/src/ATen/Version.cpp
+++ b/aten/src/ATen/Version.cpp
@ -114,7 +114,7 @@ std::string used_cpu_capability() {
    case native::CPUCapability::AVX2:
      ss << "AVX2";
      break;
-#endif      
+#endif
    default:
      break;
  }
--- a/aten/src/ATen/VmapTransforms.h
+++ b/aten/src/ATen/VmapTransforms.h
@ -47,7 +47,7 @@ using VmapDimVector = SmallVector<int64_t, kVmapStaticDimVecSize>;
 // argument.

 // VmapTransform for operators that take tensors with multiple batch dims.
-// Given one or more logical views on Tensors, `logicalToPhysical` 
+// Given one or more logical views on Tensors, `logicalToPhysical`
 // permutes all of the batch dims to the front of the tensor, aligns
 // and expands the batch dims to match each other (according to their `level`),
 // and returns a VmapPhysicalView on the tensor(s).
--- a/aten/src/ATen/core/Generator.h
+++ b/aten/src/ATen/core/Generator.h
@ -143,7 +143,7 @@ namespace detail {
 /**
 * Helper function for checking the validity of new random generator
 * state. Right now following conditions are checked:
- * 
+ *
 * - The new state tensor must be a torch.ByteTensor
 * - Data of the new state tensor must be contiguous
 */
--- a/aten/src/ATen/core/PhiloxRNGEngine.h
+++ b/aten/src/ATen/core/PhiloxRNGEngine.h
@ -40,13 +40,13 @@ typedef at::detail::Array<float, 2> FLOAT2;
 * Note that currently this implementation of the philox engine is not used
 * anywhere except for tests in cpu_generator_test.cpp. However, this engine
 * will replace curandStatePhilox4_32_10_t in the future.
- * 
+ *
 * The philox engine takes a seed value, a subsequeunce
 * for starting the generation and an offset for the subsequence.
- * Think of this engine as an algorithm producing a huge array. We are 
- * parallelizing this array by partitioning the huge array and assigning 
- * a thread index to each partition. In other words, each seed value 
- * (there are 2^64 possible seed values) gives a sub array of size 
+ * Think of this engine as an algorithm producing a huge array. We are
+ * parallelizing this array by partitioning the huge array and assigning
+ * a thread index to each partition. In other words, each seed value
+ * (there are 2^64 possible seed values) gives a sub array of size
 * 2^128 (each element in that array is a 128 bit number). Reasoning
 * behind the array being of size 2^128 is, there are 2^64 possible
 * thread index value and there is an array of size 2^64 for each of
@ -59,9 +59,9 @@ typedef at::detail::Array<float, 2> FLOAT2;
 * seed:        Seed values could be any number from 0 to 2^64-1.
 * subsequence: Subsequence is just the cuda thread indexing with:
 *              - blockIdx.x * blockDim.x + threadIdx.x
- * offset:      The offset variable in PhiloxEngine  decides how many 128-bit 
+ * offset:      The offset variable in PhiloxEngine  decides how many 128-bit
 *              random numbers to skip (i.e. how many groups of 4, 32-bit numbers to skip)
- *              and hence really decides the total number of randoms that can be achieved 
+ *              and hence really decides the total number of randoms that can be achieved
 *              for the given subsequence.
 */

--- a/aten/src/ATen/core/op_registration/README.md
+++ b/aten/src/ATen/core/op_registration/README.md
@ -254,5 +254,3 @@ Also, there's some requirements on the operator schema for it to be callable fro
 * Except for `Tensor` or `Tensor[]`, only arguments of type `int`, `double` and `bool` are supported. These can be in any position in the argument list and will be read from the caffe2 operator arguments, based on the argument name in the operator schema.
 * We do not support lists (`int[]`, `double[]` or `bool[]`) or optionals (`int?`, `double?`, `bool?`) yet.
 * The operator must return a single `Tensor` or multiple tensors as in `(Tensor, Tensor, Tensor)`. It cannot return a list `Tensor[]`, optional `Tensor?` or any primitive types.
-    
-
--- a/aten/src/ATen/core/type.cpp
+++ b/aten/src/ATen/core/type.cpp
@ -1124,12 +1124,12 @@ std::string ClassType::getForwardPreHookErrorMessage(int pre_hook_idx) const {
  const FunctionSchema& forward_schema = getMethod("forward").getSchema();
  std::string input_types = getSchemaInputTypesString(forward_schema);
  const std::vector<Argument>& forward_args = forward_schema.arguments();
-   
+
  std::string single_output = "";
  if (forward_args.size() == 2 &&
      forward_args[1].type()->cast<TupleType>() == nullptr) {
    // if the output type is a single tuple, it needs to be wrapped in an outer tuple
-    // to match eager's behavior 
+    // to match eager's behavior
    single_output = ", '" + forward_args[1].type()->annotation_str() + "',";
  }
  std::string pre_hook_schema =
@ -1138,9 +1138,9 @@ std::string ClassType::getForwardPreHookErrorMessage(int pre_hook_idx) const {
      "This error occured while scripting the forward pre-hook '" +
      pre_hook_name + "' on module '" + name()->name() +
      "'. If you did not want to script this pre-hook remove it from the "
-      "original NN module before scripting. Pre-hooks for module '" + 
-      name()->name() + "' are expected to have the following signature: " 
-      + pre_hook_schema + " with a return type of either 'None'" + 
+      "original NN module before scripting. Pre-hooks for module '" +
+      name()->name() + "' are expected to have the following signature: "
+      + pre_hook_schema + " with a return type of either 'None'" +
      single_output + " or 'Tuple[" + input_types + "]'.";
  return return_string;
 }
@ -1148,7 +1148,7 @@ std::string ClassType::getForwardPreHookErrorMessage(int pre_hook_idx) const {
 std::string ClassType::getForwardHookErrorMessage(int hook_idx) const {
  const std::string& hook_name = forward_hooks_[hook_idx]->name();
  const FunctionSchema& forward_schema = getMethod("forward").getSchema();
-  std::string input_types = getSchemaInputTypesString(forward_schema); 
+  std::string input_types = getSchemaInputTypesString(forward_schema);

  // create expected output types string
  const Argument& pre_output =
@ -1160,33 +1160,33 @@ std::string ClassType::getForwardHookErrorMessage(int hook_idx) const {
  std::string hook_schema = hook_name + "(self, input: Tuple[" +
                            input_types + "], output: " + output_types + ")";
  std::string return_string =
-      "This error occured while scripting the forward hook '" 
+      "This error occured while scripting the forward hook '"
      + hook_name + "' on module " + name()->name() +
      ". If you did not want to script this hook remove it from" +
      " the original NN module before scripting. This hook was" +
      " expected to have the following signature: " + hook_schema +
-      ". The type of the output arg is the returned type from" + 
-      " either the forward method or the previous hook if it exists. " + 
-      "Note that hooks can return anything, but if the hook is " + 
+      ". The type of the output arg is the returned type from" +
+      " either the forward method or the previous hook if it exists. " +
+      "Note that hooks can return anything, but if the hook is " +
      "on a submodule the outer module is expecting" +
      " the same return type as the submodule's forward.";
  return return_string;
 }

 void checkForwardHookInputArguments(
-    const FunctionSchema& forward_schema, 
-    const FunctionSchema& hook_schema, 
-    const std::string& hook_id, 
+    const FunctionSchema& forward_schema,
+    const FunctionSchema& hook_schema,
+    const std::string& hook_id,
    const std::string& hook_err_msg) {
  // check for proper tuple input types
  const std::vector<Argument>& forward_args = forward_schema.arguments();
  const Argument input_arg = hook_schema.arguments()[1];
  TORCH_CHECK(
-      input_arg.type()->cast<TupleType>() != nullptr, 
+      input_arg.type()->cast<TupleType>() != nullptr,
      hook_id,
      "expected the input argument to be typed as a Tuple but found type: '",
-      input_arg.type()->annotation_str(), 
-      "' instead.\n", 
+      input_arg.type()->annotation_str(),
+      "' instead.\n",
      hook_err_msg
   );

@ -1229,7 +1229,7 @@ void checkForwardHookInputArguments(
 }

 void ClassType::checkForwardPreHookSchema(
-    int pre_hook_idx, 
+    int pre_hook_idx,
    const FunctionSchema& pre_hook_schema) const {
  const torch::jit::Function* pre_hook = forward_pre_hooks_[pre_hook_idx];
  std::string hook_id =
@ -1261,7 +1261,7 @@ void ClassType::checkForwardPreHookSchema(
            pre_hook_err_msg
  );
  const Argument return_arg = pre_hook_schema.returns()[0];
-  std::string wrong_type_returned_err_msg = hook_id + 
+  std::string wrong_type_returned_err_msg = hook_id +
      "returned the wrong type of: '" +
      return_arg.type()->annotation_str() + "'.";

@ -1269,9 +1269,9 @@ void ClassType::checkForwardPreHookSchema(
    return;
  }
  if (forward_args.size() == 2 && *forward_args[1].type() == *return_arg.type()) {
-    // TORCH_CHECK below is for the edge case where forward's input is a tuple and the 
+    // TORCH_CHECK below is for the edge case where forward's input is a tuple and the
    // pre-hook returns a matching tuple. Eager doesn't support this- the working eager return
-    // for a tuple type is the forward's input tuple wrapped inside of another tuple. 
+    // for a tuple type is the forward's input tuple wrapped inside of another tuple.
    TORCH_CHECK(
        return_arg.type()->cast<TupleType>() == nullptr,
        wrong_type_returned_err_msg,
@ -1316,7 +1316,7 @@ void ClassType::checkForwardPreHookSchema(
  for (int i = 1; i < forward_args.size(); ++i) {
    if (*forward_args[i].type() != *return_tuple_types[i - 1]) {
      TORCH_CHECK(
-          false, 
+          false,
          wrong_type_returned_err_msg,
          " The returned tuple contains the wrong inner types.\n",
          pre_hook_err_msg);
@ -1325,7 +1325,7 @@ void ClassType::checkForwardPreHookSchema(
 }

 void ClassType::checkForwardHookSchema(
-      int hook_idx, 
+      int hook_idx,
      const FunctionSchema& hook_schema) const {
  const torch::jit::Function* hook = forward_hooks_[hook_idx];
  std::string hook_id =
@ -1388,8 +1388,8 @@ torch::jit::Function& ClassType::getMethod(const std::string& name) const {
 torch::jit::Function* ClassType::findHook(const std::string& name) const {
  auto hook = findForwardHook(name);
  if (hook == nullptr) {
-    hook = findForwardPreHook(name); 
-  } 
+    hook = findForwardPreHook(name);
+  }
  return hook;
 }

--- a/aten/src/ATen/cpu/vec256/vec256_double.h
+++ b/aten/src/ATen/cpu/vec256/vec256_double.h
@ -113,7 +113,7 @@ public:
    const auto not_nan_mask = _mm256_cmp_pd(values, values, _CMP_EQ_OQ);
    const auto nan_mask = _mm256_cmp_pd(not_nan_mask, zero_vec, _CMP_EQ_OQ);
    const auto pi = _mm256_set1_pd(c10::pi<double>);
-    
+
    const auto neg_mask = _mm256_cmp_pd(values, zero_vec, _CMP_LT_OQ);
    auto angle = _mm256_blendv_pd(zero_vec, pi, neg_mask);
    angle = _mm256_blendv_pd(angle, nan_vec, nan_mask);
--- a/aten/src/ATen/cpu/vec256/vec256_float.h
+++ b/aten/src/ATen/cpu/vec256/vec256_float.h
@ -120,7 +120,7 @@ public:
    const auto not_nan_mask = _mm256_cmp_ps(values, values, _CMP_EQ_OQ);
    const auto nan_mask = _mm256_cmp_ps(not_nan_mask, zero_vec, _CMP_EQ_OQ);
    const auto pi = _mm256_set1_ps(c10::pi<float>);
-    
+
    const auto neg_mask = _mm256_cmp_ps(values, zero_vec, _CMP_LT_OQ);
    auto angle = _mm256_blendv_ps(zero_vec, pi, neg_mask);
    angle = _mm256_blendv_ps(angle, nan_vec, nan_mask);
--- a/aten/src/ATen/cpu/vec256/vsx/vec256_complex_double_vsx.h
+++ b/aten/src/ATen/cpu/vec256/vsx/vec256_complex_double_vsx.h
@ -364,7 +364,7 @@ class Vec256<ComplexDbl> {
  }

  Vec256<ComplexDbl> sqrt() const {
-    return map(std::sqrt);  
+    return map(std::sqrt);
  }

  Vec256<ComplexDbl> reciprocal() const {
--- a/aten/src/ATen/cpu/vec256/vsx/vec256_complex_float_vsx.h
+++ b/aten/src/ATen/cpu/vec256/vsx/vec256_complex_float_vsx.h
@ -417,7 +417,7 @@ class Vec256<ComplexFlt> {
  }

  Vec256<ComplexFlt> sqrt() const {
-    return map(std::sqrt);  
+    return map(std::sqrt);
  }

  Vec256<ComplexFlt> reciprocal() const {
--- a/aten/src/ATen/cpu/vec256/vsx/vec256_double_vsx.h
+++ b/aten/src/ATen/cpu/vec256/vsx/vec256_double_vsx.h
@ -82,7 +82,7 @@ class Vec256<double> {
      blend(const Vec256<double>& a, const Vec256<double>& b) {
      return { a._vec0, b._vec1 };
  }
- 
+

  template <int64_t mask>
  static std::enable_if_t<blendChoiceDbl(mask) == 4, Vec256<double>> C10_ALWAYS_INLINE
@ -206,7 +206,7 @@ class Vec256<double> {
    for (int i = 0; i < size()/2; i++) {
        ret._vec0[i] = f(_vec0[i], other._vec0[i]);
    }
-    for (int i = 0; i < size()/2; i++) {   
+    for (int i = 0; i < size()/2; i++) {
        ret._vec1[i] = f(_vec1[i], other._vec1[i]);
    }
    return ret;
@ -314,7 +314,7 @@ class Vec256<double> {
  Vec256<double> C10_ALWAYS_INLINE sqrt() const {
    return {vec_sqrt(_vec0), vec_sqrt(_vec1)};
  }
-  Vec256<double> C10_ALWAYS_INLINE reciprocal() const { 
+  Vec256<double> C10_ALWAYS_INLINE reciprocal() const {
    return {
        vec_div(vd_one, _vec0), // vec_re(_vec0) is estimated one.
        vec_div(vd_one, _vec1)};
--- a/aten/src/ATen/cpu/vec256/vsx/vec256_qint32_vsx.h
+++ b/aten/src/ATen/cpu/vec256/vsx/vec256_qint32_vsx.h
@ -134,11 +134,11 @@ struct Vec256<c10::qint32> {
    Vec256<float> vf0 = rhs[0];

    vfloat32 vecf0 = vf0.vec0();
-    vfloat32 vecf1 = vf0.vec1(); 
+    vfloat32 vecf1 = vf0.vec1();
    vecf0 = vec_mul(vecf0, inverse_scale_v);
    vecf1 = vec_mul(vecf1, inverse_scale_v);
    vecf0 = vec_add(vec_rint(vecf0), vec_zero_point);
-    vecf1 = vec_add(vec_rint(vecf1), vec_zero_point);     
+    vecf1 = vec_add(vec_rint(vecf1), vec_zero_point);
    vint32 veci0  = vec_signed(vecf0);
    vint32 veci1  = vec_signed(vecf1);

@ -171,7 +171,7 @@ struct Vec256<c10::qint32> {
      float multiplier,
      int32_t zero_point) {
    const vint32 vmin = vec_splats(std::numeric_limits<value_type>::min());
-    const vint32 vmax = vec_splats(std::numeric_limits<value_type>::max());        
+    const vint32 vmax = vec_splats(std::numeric_limits<value_type>::max());
    vfloat32 vec_mult = vec_splats(multiplier);
    vint32 vec_zero_point = vec_splats(zero_point);
    Vec256<c10::qint32> vi = inp[0];
--- a/aten/src/ATen/cpu/vec256/vsx/vec256_qint8_vsx.h
+++ b/aten/src/ATen/cpu/vec256/vsx/vec256_qint8_vsx.h
@ -337,7 +337,7 @@ struct Vec256<c10::qint8> {
    vint32 veci4 = vec_signed(vecf4);
    vint32 veci5 = vec_signed(vecf5);
    vint32 veci6 = vec_signed(vecf6);
-    vint32 veci7 = vec_signed(vecf7); 
+    vint32 veci7 = vec_signed(vecf7);

    veci0 = vec_add(veci0, vec_zero_point);
    veci1 = vec_add(veci1, vec_zero_point);
@ -348,7 +348,7 @@ struct Vec256<c10::qint8> {
    veci5 = vec_add(veci5, vec_zero_point);
    veci6 = vec_add(veci6, vec_zero_point);
    veci7 = vec_add(veci7, vec_zero_point);
- 
+
    vint16 vecshi0 = vec_packs(veci0, veci1);
    vint16 vecshi1 = vec_packs(veci2, veci3);
    vint16 vecshi2 = vec_packs(veci4, veci5);
--- a/aten/src/ATen/cpu/vec256/vsx/vec256_quint8_vsx.h
+++ b/aten/src/ATen/cpu/vec256/vsx/vec256_quint8_vsx.h
@ -345,8 +345,8 @@ struct Vec256<c10::quint8> {
    vint32 veci4 = vec_signed(vecf4);
    vint32 veci5 = vec_signed(vecf5);
    vint32 veci6 = vec_signed(vecf6);
-    vint32 veci7 = vec_signed(vecf7); 
-    
+    vint32 veci7 = vec_signed(vecf7);
+
    veci0 = vec_add(veci0, vec_zero_point);
    veci1 = vec_add(veci1, vec_zero_point);
    veci2 = vec_add(veci2, vec_zero_point);
@ -356,11 +356,11 @@ struct Vec256<c10::quint8> {
    veci5 = vec_add(veci5, vec_zero_point);
    veci6 = vec_add(veci6, vec_zero_point);
    veci7 = vec_add(veci7, vec_zero_point);
- 
+
    vint16 vecshi0 = vec_packs(veci0, veci1);
    vint16 vecshi1 = vec_packs(veci2, veci3);
    vint16 vecshi2 = vec_packs(veci4, veci5);
-    vint16 vecshi3 = vec_packs(veci6, veci7);  
+    vint16 vecshi3 = vec_packs(veci6, veci7);

    vuint8 vec0 = vec_packsu(vecshi0, vecshi1);
    vuint8 vec1 = vec_packsu(vecshi2, vecshi3);
--- a/aten/src/ATen/cuda/CUDAGeneratorImpl.cpp
+++ b/aten/src/ATen/cuda/CUDAGeneratorImpl.cpp
@ -193,7 +193,7 @@ void CUDAGeneratorImpl::set_state(const c10::TensorImpl& new_state) {
  } else {
    TORCH_CHECK(new_state_size == total_size, "RNG state is wrong size");
  }
-  
+
  uint64_t input_seed;
  auto new_rng_state = new_state.data<uint8_t>();
  memcpy(&input_seed, new_rng_state + states_size, seed_size);
--- a/aten/src/ATen/native/AffineGridGenerator.cpp
+++ b/aten/src/ATen/native/AffineGridGenerator.cpp
@ -3,7 +3,7 @@

 namespace at { namespace native {

-at::Tensor linspace_from_neg_one(const Tensor& grid, int64_t num_steps, 
+at::Tensor linspace_from_neg_one(const Tensor& grid, int64_t num_steps,
                                 bool align_corners) {
  if (num_steps <= 1) {
    return at::tensor(0, grid.options());
--- a/aten/src/ATen/native/AveragePool2d.cpp
+++ b/aten/src/ATen/native/AveragePool2d.cpp
@ -96,7 +96,7 @@ void avg_pool2d_out_cpu_template(
          Tensor &output,
          const Tensor &input_,
          IntArrayRef kernel_size,
-          IntArrayRef stride, 
+          IntArrayRef stride,
          IntArrayRef padding,
          bool ceil_mode,
          bool count_include_pad,
--- a/aten/src/ATen/native/CPUBlas.cpp
+++ b/aten/src/ATen/native/CPUBlas.cpp
@ -11,11 +11,11 @@ extern "C" void zgemm_(char *transa, char *transb, int *m, int *n, int *k, void
 #endif  // AT_BUILD_WITH_BLAS()

 #if AT_BUILD_WITH_BLAS()
-extern "C" void cswap_(int *n, const void *x, int *incx, void *y, int *incy); 
+extern "C" void cswap_(int *n, const void *x, int *incx, void *y, int *incy);
 extern "C" void dcopy_(int *n, const double *x, int *incx, double *y, int *incy);
 extern "C" void scopy_(int *n, const float *x, int *incx, float *y, int *incy);
-extern "C" void zcopy_(int *n, const void *x, int *incx, void *y, int *incy); 
-extern "C" void ccopy_(int *n, const void *x, int *incx, void *y, int *incy); 
+extern "C" void zcopy_(int *n, const void *x, int *incx, void *y, int *incy);
+extern "C" void ccopy_(int *n, const void *x, int *incx, void *y, int *incy);
 extern "C" void daxpy_(int *n, double *a, const double *x, int *incx, double *y, int *incy);
 extern "C" void saxpy_(int *n, float *a, const float *x, int *incx, float *y, int *incy);
 extern "C" void caxpy_(int *n, void *a, const void *x, int *incx, void *y, int *incy);
@ -279,7 +279,7 @@ void axpy(int64_t n, double a, const double *x, int64_t incx, double *y, int64_t
    daxpy_(&i_n, &a, x, &i_incx, y, &i_incy);
    return;
  }
-  #endif 
+  #endif
  axpy_stub(
      kCPU, at::kDouble,
      n, a, x, incx, y, incy);
@ -300,7 +300,7 @@ void axpy(int64_t n, float a, const float *x, int64_t incx, float *y, int64_t in
    saxpy_(&i_n, &a, x, &i_incx, y, &i_incy);
    return;
  }
-  #endif 
+  #endif
  axpy_stub(
      kCPU, at::kFloat,
      n, a, x, incx, y, incy);
@ -321,7 +321,7 @@ void axpy(int64_t n, c10::complex<double> a, const c10::complex<double> *x, int6
    zaxpy_(&i_n, &a, x, &i_incx, y, &i_incy);
    return;
  }
-  #endif 
+  #endif
  axpy_stub(
      kCPU, at::kComplexDouble,
      n, a, x, incx, y, incy);
@ -342,7 +342,7 @@ void axpy(int64_t n, c10::complex<float> a, const c10::complex<float> *x, int64_
    caxpy_(&i_n, &a, x, &i_incx, y, &i_incy);
    return;
  }
-  #endif 
+  #endif
  axpy_stub(
      kCPU, at::kComplexFloat,
      n, a, x, incx, y, incy);
@ -364,7 +364,7 @@ void copy(int64_t n, const double *x, int64_t incx, double *y, int64_t incy) {
    dcopy_(&i_n, x, &i_incx, y, &i_incy);
    return;
  }
-  #endif 
+  #endif
  copy_stub(
      kCPU, at::kDouble,
      n, x, incx, y, incy);
@ -384,7 +384,7 @@ void copy(int64_t n, const float *x, int64_t incx, float *y, int64_t incy) {
    scopy_(&i_n, x, &i_incx, y, &i_incy);
    return;
  }
-  #endif 
+  #endif
  copy_stub(
      kCPU, at::kFloat,
      n, x, incx, y, incy);
@ -404,7 +404,7 @@ void copy(int64_t n, const c10::complex<double> *x, int64_t incx, c10::complex<d
    zcopy_(&i_n, x, &i_incx, y, &i_incy);
    return;
  }
-  #endif 
+  #endif
  copy_stub(
      kCPU, at::kComplexDouble,
      n, x, incx, y, incy);
@ -424,10 +424,10 @@ void copy(int64_t n, const c10::complex<float> *x, int64_t incx, c10::complex<fl
    ccopy_(&i_n, x, &i_incx, y, &i_incy);
    return;
  }
-  #endif 
+  #endif
  copy_stub(
      kCPU, at::kComplexFloat,
      n, x, incx, y, incy);
 }
- 
+
 }}}  // namespace at::native::cpublas
--- a/aten/src/ATen/native/ComplexHelper.h
+++ b/aten/src/ATen/native/ComplexHelper.h
@ -4,7 +4,7 @@

 namespace at { namespace native {

-// View tensor with new dtype, storage offset, sizes and strides 
+// View tensor with new dtype, storage offset, sizes and strides
 inline Tensor view_tensor(
    const Tensor &tensor, ScalarType dtype,
    int64_t offset, IntArrayRef sizes, IntArrayRef strides) {
--- a/aten/src/ATen/native/Distributions.h
+++ b/aten/src/ATen/native/Distributions.h
@ -128,7 +128,7 @@ C10_DEVICE static inline scalar_t polevl(const scalar_t x,  const scalar_t A[],
 }

 /* the functions stirling_approx_tail, binomial_inversion, and btrs are adapted
- * from TensorFlow's random_binomial_op.cc implementation. That code is under 
+ * from TensorFlow's random_binomial_op.cc implementation. That code is under
 * copyright: 2019 The TensorFlow Authors.
 *
 * It was released under the Apache License, Version 2.0 (the "License"), available at:
--- a/aten/src/ATen/native/ForeachUtils.h
+++ b/aten/src/ATen/native/ForeachUtils.h
@ -4,7 +4,7 @@
 namespace at {
 namespace native {
 namespace {
-// Check foreach API restrictions 
+// Check foreach API restrictions
 // - Tensor lists must be non-empty.
 // - All tensors in all lists must have the same dtype.
 // - All TensorLists and ScalarLists must have the same number of elements.
--- a/aten/src/ATen/native/GatedLinearUnit.cpp
+++ b/aten/src/ATen/native/GatedLinearUnit.cpp
@ -51,7 +51,7 @@ Tensor& glu_backward_out(Tensor& grad_input,
  Tensor secondHalf = input.narrow(wrap_dim, inputSize, inputSize);
  Tensor gradInputfirstHalf = grad_input.narrow(wrap_dim, 0, inputSize);
  Tensor gradInputsecondHalf = grad_input.narrow(wrap_dim, inputSize, inputSize);
-  
+
  at::sigmoid_out(gradInputfirstHalf, secondHalf);
  // for second gradinput half, can get a better performance by fusion
  auto iter = at::TensorIteratorConfig()
--- a/aten/src/ATen/native/GridSampler.cpp
+++ b/aten/src/ATen/native/GridSampler.cpp
@ -485,7 +485,7 @@ Tensor _grid_sampler_2d_cpu_fallback(const Tensor& input, const Tensor& grid,
              }
            }
          } else if (interpolation_mode == GridSamplerInterpolation::Bicubic) {
-            // grid_sampler_compute_source_index will "clip the value" of idx depends on the padding, 
+            // grid_sampler_compute_source_index will "clip the value" of idx depends on the padding,
            // which would cause calculation to be wrong,
            // for example x = -0.1 -> ix = 0 for zero padding, but in bicubic ix = floor(x) = -1
            // There would be more problem in reflection padding, since the -1 and +1 direction is not fixed in boundary condition
--- a/aten/src/ATen/native/Im2Col.cpp
+++ b/aten/src/ATen/native/Im2Col.cpp
@ -10,7 +10,7 @@
 namespace at {
 namespace native {
 namespace {
-  
+
 static void im2col_out_cpu_template(
    Tensor& output,
    const Tensor& input_,
--- a/aten/src/ATen/native/LossMulti.h
+++ b/aten/src/ATen/native/LossMulti.h
@ -56,7 +56,7 @@ namespace {
      nframe = input.size(0);
      dim = input.size(1);
    }
-    
+
    TORCH_CHECK(
                valid_inputs,
                "Expected non-empty vector or matrix with optional 0-dim batch size, but got: ",
--- a/aten/src/ATen/native/LossMultiLabelMargin.cpp
+++ b/aten/src/ATen/native/LossMultiLabelMargin.cpp
@ -40,7 +40,7 @@ inline scalar_t multilabel_margin_loss_forward_inner_sum_cpu(
      }
    }
  }
-  
+
  return sum;
 }

@ -103,7 +103,7 @@ static void multilabel_margin_loss_forward_out_cpu_template(
    int64_t reduction) {
  auto target_arg = TensorArg(target, "target", 2);
  int64_t nframe, dim;
-  const int64_t ndims = input.dim();  
+  const int64_t ndims = input.dim();
  if (ndims <= 1) {
    nframe = 1;
    dim = ndims == 0 ? 1 : input.size(0);
@ -113,7 +113,7 @@ static void multilabel_margin_loss_forward_out_cpu_template(
    dim = input.size(1);
  }
  multilabel_margin_loss_shape_check(nframe, dim, ndims, target_arg, input, target);
-  
+
  // special case target.dim() <= 1: produce scalar output for scalar inputs
  // even if reduction == Reduction::None
  if (reduction != Reduction::None || target.dim() <= 1) {
@ -228,12 +228,12 @@ static void multilabel_margin_loss_backward_out_cpu_template(

  multilabel_margin_loss_shape_check(nframe, dim, ndims, target_arg, input, target);
  checkSameSize(c, target_arg, is_target_arg);
-  
+
  grad_input.resize_as_(input);
  if (grad_input.numel() == 0) {
    return;
  }
-  
+
  TORCH_CHECK(grad_input.is_contiguous(), "grad_input must be contiguous");
  grad_input.zero_();

--- a/aten/src/ATen/native/LossMultiMargin.cpp
+++ b/aten/src/ATen/native/LossMultiMargin.cpp
@ -205,7 +205,7 @@ void multi_margin_loss_backward_out_cpu_template(
  int64_t nframe, dim;
  auto target_arg = TensorArg(target, "target", 2);
  const auto ndims = input.dim();
-  
+
  TORCH_CHECK(p == 1 || p == 2, "only p == 1 and p == 2 supported");

  multi_margin_loss_shape_check(nframe, dim, ndims, target_arg, input, target);
--- a/aten/src/ATen/native/LossNLL2d.cpp
+++ b/aten/src/ATen/native/LossNLL2d.cpp
@ -153,7 +153,7 @@ static void nll_loss2d_forward_out_frame(
  for (int64_t b = 0; b < batch_size; b++) {
    for (int64_t elem = 0; elem < map_size; elem++) {
      const int64_t cur_target = target_data[b * map_size + elem];
-      
+
      if (cur_target == ignore_index) {
        continue;
      }
@ -284,7 +284,7 @@ static void nll_loss2d_backward_out_frame(
    for (int64_t b = start; b < end; b++) {
      for (int64_t elem = 0; elem < map_size; elem++) {
        const int64_t cur_target = target_data[b * map_size + elem];
-        
+
        if (cur_target == ignore_index) {
          continue;
        }
--- a/aten/src/ATen/native/Pool.h
+++ b/aten/src/ATen/native/Pool.h
@ -68,7 +68,7 @@ pool2d_shape_check(
  TORCH_CHECK(dilationH > 0 && dilationW > 0,
              "dilation should be greater than zero, but got ",
              "dilationH: ", dilationH, " dilationW: ", dilationW);
-  
+
  bool valid_dims = input.size(1) != 0 && input.size(2) != 0;
  if (memory_format == at::MemoryFormat::ChannelsLast){
    // Expect tensor in NHWC format and allow 0-dim only for N.
@ -81,7 +81,7 @@ pool2d_shape_check(
      "Expected 3D or 4D (batch mode) tensor with optional 0 dim batch size for input, but got:",
      input.sizes());
  }
-  
+
  TORCH_CHECK(kW/2 >= padW && kH/2 >= padH,
              "pad should be smaller than or equal to half of kernel size, but got ",
              "padW = ", padW, ", padH = ", padH, ", kW = ", kW, ", kH = ", kH);
--- a/aten/src/ATen/native/Pow.h
+++ b/aten/src/ATen/native/Pow.h
@ -16,7 +16,7 @@ namespace native {
 #endif

 // integral power in pytorch allows for negative exponents, giving truncated integral results.
-// e.g. since 2**-1==0.5, the truncated integral result is zero. 1**negative_exponent is the 
+// e.g. since 2**-1==0.5, the truncated integral result is zero. 1**negative_exponent is the
 // only non-zero result.
 template <class T,
  typename std::enable_if<std::is_integral<T>::value, T>::type* = nullptr>
--- a/aten/src/ATen/native/ReplicationPadding.cpp
+++ b/aten/src/ATen/native/ReplicationPadding.cpp
@ -219,7 +219,7 @@ Tensor& replication_pad1d_backward_out_cpu_template(
  gradInput.resize_as_(input);
  if (gradInput.numel() == 0) {
    return gradInput;
-  }           
+  }
  gradInput.zero_();

  /* backprop */
@ -522,7 +522,7 @@ Tensor& replication_pad2d_backward_out_cpu_template(
  if (gradInput.numel() == 0) {
    return gradInput;
  }
-  
+
  gradInput.zero_();

  /* backprop */
--- a/aten/src/ATen/native/StridedRandomAccessor.h
+++ b/aten/src/ATen/native/StridedRandomAccessor.h
@ -143,7 +143,7 @@ public:
    return (ptr - other.ptr) / stride;
  }
  // }
-  
+
  // Comparison operators {
  C10_HOST_DEVICE
  bool operator==(const ConstStridedRandomAccessor& other) const {
@ -175,7 +175,7 @@ public:
    return !(*this < other);
  }
  // }
-  
+
 protected:
  PtrType ptr;
  index_t stride;
@ -186,7 +186,7 @@ template <
  typename index_t = int64_t,
  template <typename U> class PtrTraits = DefaultPtrTraits
 >
-class StridedRandomAccessor 
+class StridedRandomAccessor
  : public ConstStridedRandomAccessor<T, index_t, PtrTraits> {
 public:
  using difference_type = index_t;
--- a/aten/src/ATen/native/TensorTransformations.h
+++ b/aten/src/ATen/native/TensorTransformations.h
@ -12,7 +12,7 @@ namespace native {
 static inline void flip_check_errors(int64_t total_dims, int64_t flip_dims_size, IntArrayRef dims) {
  if (flip_dims_size==0) {
    return;
-  } 
+  }
  // check if number of axis in dim is valid
  if (flip_dims_size < 0 || flip_dims_size > total_dims) {
    TORCH_CHECK_INDEX(false, "flip dims size out of range, got flip dims size=", flip_dims_size);
--- a/aten/src/ATen/native/cpu/GridSamplerKernel.cpp
+++ b/aten/src/ATen/native/cpu/GridSamplerKernel.cpp
@ -844,14 +844,14 @@ struct ApplyGridSample<scalar_t, 2, GridSamplerInterpolation::Bicubic,
    auto mask_x = must_in_bound ? iVec(-1) : (ix > iVec(-1)) & (ix < iVec(inp_W));
    auto mask_y = must_in_bound ? iVec(-1) : (iy > iVec(-1)) & (iy < iVec(inp_H));
    auto mask = cast<scalar_t>(mask_x & mask_y);
-    
+
    auto offset = iy * iVec(inp_sH) + ix * iVec(inp_sW);

    auto val = mask_gather<sizeof(scalar_t)>(Vec(0), data, offset, mask);
    return val;
  }

-  inline void add_value_bounded(scalar_t* data, int64_t len, const Vec& x, const Vec&y, 
+  inline void add_value_bounded(scalar_t* data, int64_t len, const Vec& x, const Vec&y,
                               const Vec& delta) const {

    auto ix = convert_to_int_of_same_size(compute_W.compute_coordinates(x));
@ -860,7 +860,7 @@ struct ApplyGridSample<scalar_t, 2, GridSamplerInterpolation::Bicubic,
    auto mask_x = must_in_bound ? iVec(-1) : (ix > iVec(-1)) & (ix < iVec(inp_W));
    auto mask_y = must_in_bound ? iVec(-1) : (iy > iVec(-1)) & (iy < iVec(inp_H));
    auto mask = cast<scalar_t>(mask_x & mask_y);
-    
+
    auto i_gInp_offset = iy * iVec(inp_W) + ix;
    integer_t i_gInp_offset_arr[iVec::size()];
    i_gInp_offset.store(i_gInp_offset_arr);
@ -899,7 +899,7 @@ struct ApplyGridSample<scalar_t, 2, GridSamplerInterpolation::Bicubic,
      // Interpolate the 4 values in the x direction
      Vec interp_x[4];
      for (int64_t i = 0; i < 4; ++i) {
-        interp_x[i] = 
+        interp_x[i] =
          coeff_x[0] * get_value_bounded(inp_slice_C_ptr, ix - Vec(1), iy + Vec(-1 + i)) +
          coeff_x[1] * get_value_bounded(inp_slice_C_ptr, ix + Vec(0), iy + Vec(-1 + i)) +
          coeff_x[2] * get_value_bounded(inp_slice_C_ptr, ix + Vec(1), iy + Vec(-1 + i)) +
--- a/aten/src/ATen/native/cpu/SoftMaxKernel.cpp
+++ b/aten/src/ATen/native/cpu/SoftMaxKernel.cpp
@ -76,7 +76,7 @@ inline void _vec_log_softmax_lastdim(
            scalar_t* output_data = output_data_base + i * dim_size;
            scalar_t tmp_sum = tmp_sum_scalar[j];
            scalar_t max_input = max_input_arr[j];
-            
+
            // It's necessary to keep the order of the operations below.
            // In some cases that input is large digits and the difference
            // is small, if we compute `max_input` plus `tmp_sum` before,
--- a/aten/src/ATen/native/cpu/UnfoldBackwardKernel.cpp
+++ b/aten/src/ATen/native/cpu/UnfoldBackwardKernel.cpp
@ -39,7 +39,7 @@
 // grad_in[...,i_in_dim,...,i_in_last_dim], where
 // i_in_dim is in [left_idx_fold, right_idx_fold],
 // i_in_last_dim = i_out_dim - i_in_dim * step,
-// left_idx_fold = (i_out_dim - size) / step 
+// left_idx_fold = (i_out_dim - size) / step
 //  if i_out_dim in [left_idx_fold * step, left_idx_fold * step + size)
 //  else (i_out_dim - size) / step + 1,
 // right_idx_fold = i_out_dim / step.
--- a/aten/src/ATen/native/cpu/UpSampleMoreKernel.cpp
+++ b/aten/src/ATen/native/cpu/UpSampleMoreKernel.cpp
@ -45,7 +45,7 @@ static inline void compute_source_index_and_lambda(

 // Helper structs and methods for cpu_upsample_linear
 //
-// Interpolation methods that used below are separable, and as such we can compute the interpolation 
+// Interpolation methods that used below are separable, and as such we can compute the interpolation
 // independently per dimension in a recursive way. Please, refer to #10482 for more context.
 //
 // Linear Interpolation structure to compute output value in n-dimensional case.
@ -96,26 +96,26 @@ static inline bool is_contiguous_stride(const int64_t* strides) {
 }


-// Helper class to recursively check if all input strides corresponding to interpolated dimensions 
+// Helper class to recursively check if all input strides corresponding to interpolated dimensions
 // are equal zero except on a single dimension.
-// 
+//
 // Inputs: array of strides of size N, non_zero_stride_dim which can be -1, 0, 1, 2, ...
 //   if non_zero_stride_dim, we check that all strides are equal zero, otherwise
 //   4 strides corresponding to the strides for index_0, weight_0, index_1 and weight_1 for non_zero_stride_dim
 //   dimension should be non zero.
-// 
-// Unit check of the recursion is to verify whether 4 strides for one interpolated dimension are either zero, 
+//
+// Unit check of the recursion is to verify whether 4 strides for one interpolated dimension are either zero,
 // see method is_zero_stride, or (sizeof(index_t), sizeof(scalar_t), sizeof(index_t), sizeof(scalar_t)), see
 // method is_contiguous_stride.
-// 
+//
 // In practice, we have the following cases:
-// - for ND, float32, channel first, strides are 
+// - for ND, float32, channel first, strides are
 //         dimN-1,              dim1,           dim0
 //         i0, w0, i1, w1, ..., i0, w0, i1, w1, i0, w0, i1, w1
 // strides=(0,  0,  0,  0, ...,  0,  0,  0,  0,  4,  4,  4,  4)
 //
 // if size dim0 is 1 then its strides are 0 and dim1 strides are equal 4
-// 
+//
 // - for ND, float32, channel last, strides are
 //         dimN-1,         dimN-2,             dim0
 //         i0, w0, i1, w1, i0, w0, i1, w1, ... i0, w0, i1, w1
@ -155,7 +155,7 @@ static inline void basic_loop(char** data, const int64_t* strides, int64_t n) {
 }

 // Linear upsampling computation method using TensorIterator for Nd case.
-// 
+//
 // Single loop function for 1d, 2d and 3d cases.
 // For N dimensions, output value up to Di dimension can be computed as
 //
@ -505,7 +505,7 @@ void cpu_upsample_linear_backward(
 //
 template<typename scalar_t>
 std::vector<Tensor> compute_indices_weights_linear(
-  int64_t input_size, int64_t output_size, int64_t stride, int64_t ndims, int64_t reshape_dim, 
+  int64_t input_size, int64_t output_size, int64_t stride, int64_t ndims, int64_t reshape_dim,
  bool align_corners, const c10::optional<double> opt_scale
 ) {

@ -516,7 +516,7 @@ std::vector<Tensor> compute_indices_weights_linear(
  new_shape[reshape_dim] = output_size;

  output.emplace_back(empty(new_shape, CPU(at::kLong)));
-  output.emplace_back(empty(new_shape, CPU(c10::CppTypeToScalarType<scalar_t>())));  
+  output.emplace_back(empty(new_shape, CPU(c10::CppTypeToScalarType<scalar_t>())));
  output.emplace_back(empty(new_shape, CPU(at::kLong)));
  output.emplace_back(empty(new_shape, CPU(c10::CppTypeToScalarType<scalar_t>())));

@ -524,7 +524,7 @@ std::vector<Tensor> compute_indices_weights_linear(
  auto lambda0_ptr = output[1].data_ptr<scalar_t>();
  auto input_index1_ptr = output[2].data_ptr<int64_t>();
  auto lambda1_ptr = output[3].data_ptr<scalar_t>();
-  
+
  for (int64_t i=0; i<output_size; i++) {

    compute_source_index_and_lambda<scalar_t>(
@ -543,7 +543,7 @@ std::vector<Tensor> compute_indices_weights_linear(
 }

 // Upsampling linear interpolation kernel for N-d case.
-// Input is assumed to be like NCHW, NCL, NCKHW - interpolated spatial dimension 
+// Input is assumed to be like NCHW, NCL, NCKHW - interpolated spatial dimension
 // are those from the end up to batch size N and number of channels C.
 //
 // Internally, it uses TensorIterator to optimize the computations.
@ -588,8 +588,8 @@ void upsample_linearNd_kernel_impl(
    .declare_static_dtype_and_device(input.scalar_type(), input.device())
    .add_output(output)
    .add_input(restrided_input);
-  
-  for (auto iter=indices_weights.begin(); iter!=indices_weights.end(); iter++) { 
+
+  for (auto iter=indices_weights.begin(); iter!=indices_weights.end(); iter++) {
    for (auto& tensor : *iter) {
      config.add_input(tensor);
    }
--- a/aten/src/ATen/native/cpu/batch_norm_kernel.cpp
+++ b/aten/src/ATen/native/cpu/batch_norm_kernel.cpp
@ -71,7 +71,7 @@ void batch_norm_cpu_inference_contiguous_impl(Tensor& output,
  if (image_size != 1) {
    const int64_t n_offset = n_channel * image_size;
    const int64_t loop_size = image_size - (image_size % Vec::size());
-    for (int64_t n = 0; n < n_batch; n++) { 
+    for (int64_t n = 0; n < n_batch; n++) {
      for (int64_t c = 0; c < n_channel; c++) {
        const Vec alpha_vec(alpha_data[c]);
        const Vec beta_vec(beta_data[c]);
--- a/aten/src/ATen/native/cuda/BatchLinearAlgebra.cu
+++ b/aten/src/ATen/native/cuda/BatchLinearAlgebra.cu
@ -1894,7 +1894,7 @@ AT_ERROR("triangular_solve: MAGMA library not found in "
  magma_int_t n = magma_int_cast(A.size(-2), "A.size(-2)");
  magma_int_t nrhs = magma_int_cast(b.size(-1), "b.size(-1)");
  // magma returns early if m <= 0 || n <= 0 for magmaTriangularSolveBatched
-  // magmaTriangularSolve is calling cuBLAS and it prints 
+  // magmaTriangularSolve is calling cuBLAS and it prints
  // ** On entry to DTRSM  parameter number 9 had an illegal value
  // so let's use proper lda parameter here
  magma_int_t lda = std::max<magma_int_t>(1, n);
@ -2282,7 +2282,7 @@ std::tuple<Tensor, Tensor> _syevd_helper_cuda(const Tensor& self, bool compute_e
  bool upper = uplo == 'U' ? true : false;
  return _symeig_helper_cuda(self, compute_eigenvectors, upper);
 }
-    
+
 // ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ svd ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

 template<typename scalar_t>
--- a/aten/src/ATen/native/cuda/BatchLinearAlgebraLib.cu
+++ b/aten/src/ATen/native/cuda/BatchLinearAlgebraLib.cu
@ -50,7 +50,7 @@ static void apply_batched_inverse_lib(Tensor& self, Tensor& self_inv, Tensor& in

  auto& allocator = *::c10::cuda::CUDACachingAllocator::get();

-  // Heuristic: For small batch size or large matrix size, we use for-loop to iterate over the batches instead of 
+  // Heuristic: For small batch size or large matrix size, we use for-loop to iterate over the batches instead of
  //            calling the batched cublas routine.
  if (batch_size <= 8 || /* batch_size > 8 && */ n >= 512) {
    for (int64_t i = 0; i < batch_size; i++) {
--- a/aten/src/ATen/native/cuda/CuFFTPlanCache.h
+++ b/aten/src/ATen/native/cuda/CuFFTPlanCache.h
@ -260,7 +260,7 @@ public:

 #ifdef __HIP_PLATFORM_HCC__
    // clone input to avoid issues with hipfft clobering the input and failing tests
-    clone_input = true; 
+    clone_input = true;
 #else
    clone_input = false;
 #endif
--- a/aten/src/ATen/native/cuda/ForeachBinaryOpList.cu
+++ b/aten/src/ATen/native/cuda/ForeachBinaryOpList.cu
@ -20,9 +20,9 @@ std::vector<Tensor> foreach_tensor_list_op(TensorList tensors1, TensorList tenso
    AT_DISPATCH_ALL_TYPES_AND_COMPLEX_AND3(kBool, kBFloat16, kHalf, tensors1[0].scalar_type(), "foreach_binary_op_list_cuda", [&]() {
        using opmath_t = get_opmath_t<scalar_t>::opmath_t;
        multi_tensor_apply<3>(tensor_lists,
-                              BinaryOpListAlphaFunctor<scalar_t, 
+                              BinaryOpListAlphaFunctor<scalar_t,
                                                       /* depth */ 3,
-                                                       /* r_args_depth */ 2, 
+                                                       /* r_args_depth */ 2,
                                                       /* res_arg_index */ 2>(),
                              Op<opmath_t>(),
                              alpha.to<opmath_t>());
@ -40,9 +40,9 @@ void foreach_tensor_list_op_(TensorList tensors1, TensorList tensors2, Scalar al
    AT_DISPATCH_ALL_TYPES_AND_COMPLEX_AND3(kBool, kBFloat16, kHalf, tensors1[0].scalar_type(), "foreach_binary_op_list_cuda_", [&]() {
        using opmath_t = get_opmath_t<scalar_t>::opmath_t;
        multi_tensor_apply<2>(tensor_lists,
-                              BinaryOpListAlphaFunctor<scalar_t, 
+                              BinaryOpListAlphaFunctor<scalar_t,
                                                       /* depth */ 2,
-                                                       /* r_args_depth */ 2, 
+                                                       /* r_args_depth */ 2,
                                                       /* res_arg_index */ 0>(),
                              Op<opmath_t>(),
                              alpha.to<opmath_t>());
--- a/aten/src/ATen/native/cuda/ForeachBinaryOpScalar.cu
+++ b/aten/src/ATen/native/cuda/ForeachBinaryOpScalar.cu
@ -19,9 +19,9 @@ std::vector<Tensor> foreach_binary_op(TensorList tensors, Scalar scalar) {
    AT_DISPATCH_ALL_TYPES_AND_COMPLEX_AND3(kBool, kBFloat16, kHalf, tensors[0].scalar_type(), "foreach_binary_op_scalar_cuda", [&]() {
        using opmath_t = get_opmath_t<scalar_t>::opmath_t;
        multi_tensor_apply<2>(tensor_lists,
-                              BinaryOpScalarFunctor<scalar_t, 
+                              BinaryOpScalarFunctor<scalar_t,
                                                    /* depth */ 2,
-                                                    /* r_args_depth */ 1, 
+                                                    /* r_args_depth */ 1,
                                                    /* res_arg_index */ 1>(),
                              Op<opmath_t>(),
                              scalar.to<opmath_t>());
@ -37,9 +37,9 @@ void foreach_binary_op_(TensorList tensors, Scalar scalar) {
    AT_DISPATCH_ALL_TYPES_AND_COMPLEX_AND3(kBool, kBFloat16, kHalf, tensors[0].scalar_type(), "foreach_binary_op_scalar_cuda_", [&]() {
        using opmath_t = get_opmath_t<scalar_t>::opmath_t;
        multi_tensor_apply<1>(tensor_lists,
-                              BinaryOpScalarFunctor<scalar_t, 
+                              BinaryOpScalarFunctor<scalar_t,
                                                    /* depth */ 1,
-                                                    /* r_args_depth */ 1, 
+                                                    /* r_args_depth */ 1,
                                                    /* res_arg_index */ 0>(),
                                                    Op<opmath_t>(),
                              scalar.to<opmath_t>());
--- a/aten/src/ATen/native/cuda/ForeachBinaryOpScalarList.cu
+++ b/aten/src/ATen/native/cuda/ForeachBinaryOpScalarList.cu
@ -20,9 +20,9 @@ std::vector<Tensor> foreach_binary_op(TensorList tensors, at::ArrayRef<Scalar> s
        using opmath_t = get_opmath_t<scalar_t>::opmath_t;
        multi_tensor_apply<2, opmath_t>(tensor_lists,
                                        scalars,
-                                        BinaryOpScalarListFunctor<scalar_t, 
+                                        BinaryOpScalarListFunctor<scalar_t,
                                                                  /* depth */ 2,
-                                                                  /* r_args_depth */ 1, 
+                                                                  /* r_args_depth */ 1,
                                                                  /* res_arg_index */ 1>(),

                                        Op<opmath_t>());
@ -39,9 +39,9 @@ void foreach_binary_op_(TensorList tensors, at::ArrayRef<Scalar> scalars) {
        using opmath_t = get_opmath_t<scalar_t>::opmath_t;
        multi_tensor_apply<1, opmath_t>(tensor_lists,
                                        scalars,
-                                        BinaryOpScalarListFunctor<scalar_t, 
+                                        BinaryOpScalarListFunctor<scalar_t,
                                                                  /* depth */ 1,
-                                                                  /* r_args_depth */ 1, 
+                                                                  /* r_args_depth */ 1,
                                                                  /* res_arg_index */ 0>(),
                                        Op<opmath_t>());
    });
--- a/aten/src/ATen/native/cuda/ForeachPointwiseOp.cu
+++ b/aten/src/ATen/native/cuda/ForeachPointwiseOp.cu
@ -22,9 +22,9 @@ std::vector<Tensor> foreach_pointwise_op(TensorList input, TensorList tensors1,
    AT_DISPATCH_ALL_TYPES_AND(kHalf, input[0].scalar_type(), "foreach_pointwise_op_cuda", [&]() {
        using opmath_t = get_opmath_t<scalar_t>::opmath_t;
        multi_tensor_apply<4>(tensor_lists,
-                              PointwiseOpScalarFunctor<scalar_t, 
+                              PointwiseOpScalarFunctor<scalar_t,
                                                       /* depth */ 4,
-                                                       /* r_args_depth */ 3, 
+                                                       /* r_args_depth */ 3,
                                                       /* res_arg_index */ 3>(),
                              Op<opmath_t>(),
                              scalar.to<opmath_t>());
@ -43,9 +43,9 @@ void foreach_pointwise_op_(TensorList input, TensorList tensors1, TensorList ten
    AT_DISPATCH_ALL_TYPES_AND(kHalf, input[0].scalar_type(), "foreach_pointwise_op__cuda", [&]() {
        using opmath_t = get_opmath_t<scalar_t>::opmath_t;
        multi_tensor_apply<3>(tensor_lists,
-                              PointwiseOpScalarFunctor<scalar_t, 
+                              PointwiseOpScalarFunctor<scalar_t,
                                                       /* depth */ 3,
-                                                       /* r_args_depth */ 3, 
+                                                       /* r_args_depth */ 3,
                                                       /* res_arg_index */ 0>(),
                              Op<opmath_t>(),
                              scalar.to<opmath_t>());
@ -64,9 +64,9 @@ void foreach_pointwise_op_(TensorList input, TensorList tensors1, TensorList ten
        using opmath_t = get_opmath_t<scalar_t>::opmath_t;
        multi_tensor_apply<3, opmath_t>(tensor_lists,
                                        scalars,
-                                        PointwiseOpScalarListFunctor<scalar_t, 
+                                        PointwiseOpScalarListFunctor<scalar_t,
                                                                     /* depth */ 3,
-                                                                     /* r_args_depth */ 3, 
+                                                                     /* r_args_depth */ 3,
                                                                     /* res_arg_index */ 0>(),
                                        Op<opmath_t>());
    });
@ -91,9 +91,9 @@ std::vector<Tensor> foreach_pointwise_op(TensorList input, TensorList tensors1,
        using opmath_t = get_opmath_t<scalar_t>::opmath_t;
        multi_tensor_apply<4, opmath_t>(tensor_lists,
                                        scalars,
-                                        PointwiseOpScalarListFunctor<scalar_t, 
+                                        PointwiseOpScalarListFunctor<scalar_t,
                                                                     /* depth */ 4,
-                                                                     /* r_args_depth */ 3, 
+                                                                     /* r_args_depth */ 3,
                                                                     /* res_arg_index */ 3>(),
                                        Op<opmath_t>());
    });
--- a/aten/src/ATen/native/cuda/ForeachUnaryOp.cu
+++ b/aten/src/ATen/native/cuda/ForeachUnaryOp.cu
@ -33,7 +33,7 @@ template <typename scalar_t, template<class> class Op> void foreach_unary_op_(Te
    multi_tensor_apply<1>(tensor_lists,
                          UnaryOpFunctor<scalar_t,
                                         /* depth */ 1,
-                                         /* r_args_depth */ 1, 
+                                         /* r_args_depth */ 1,
                                         /* res_arg_index */ 0>(),
                          Op<opmath_t>());
 }
@ -230,7 +230,7 @@ void foreach_tensor_neg_cuda_(TensorList tensors) {
 }

 // Abs have to go via slow path in case of a complex type.
-// This is because foreach kernels can't return a different dtype than passed, while 
+// This is because foreach kernels can't return a different dtype than passed, while
 // abs with complex inputs will produce float output.
 template<typename T>
 struct Abs {
@ -283,7 +283,7 @@ void foreach_tensor_zero_cuda_(TensorList tensors) {
        multi_tensor_apply<1>(tensor_lists,
                              ZeroFunctor<scalar_t,
                                          /* depth */ 1,
-                                          /* r_args_depth */ 1, 
+                                          /* r_args_depth */ 1,
                                          /* res_arg_index */ 0>());
    });
 }
--- a/aten/src/ATen/native/cuda/GridSampler.cuh
+++ b/aten/src/ATen/native/cuda/GridSampler.cuh
@ -142,14 +142,14 @@ scalar_t reflect_coordinates_set_grad(scalar_t in, int twice_low, int twice_high
  }
 }

-template<typename scalar_t> 
-static __forceinline__ __device__ 
+template<typename scalar_t>
+static __forceinline__ __device__
 scalar_t safe_downgrade_to_int_range(scalar_t x){
-  // -100.0 does not have special meaning. This is just to make sure 
-  // it's not within_bounds_2d or within_bounds_3d, and does not cause 
-  // undefined behavior. See #35506.  
-  if (x > INT_MAX-1 || x < INT_MIN || !::isfinite(static_cast<double>(x))) 
-    return static_cast<scalar_t>(-100.0); 
+  // -100.0 does not have special meaning. This is just to make sure
+  // it's not within_bounds_2d or within_bounds_3d, and does not cause
+  // undefined behavior. See #35506.
+  if (x > INT_MAX-1 || x < INT_MIN || !::isfinite(static_cast<double>(x)))
+    return static_cast<scalar_t>(-100.0);
  return x;
 }

@ -219,7 +219,7 @@ scalar_t grid_sampler_compute_source_index_set_grad(
    *grad_in = (*grad_in) * grad_refl * grad_clip;
  }

-  coord = safe_downgrade_to_int_range(coord); 
+  coord = safe_downgrade_to_int_range(coord);
  return coord;
 }

@ -244,7 +244,7 @@ scalar_t get_value_bounded(
  y = compute_coordinates(y, H, padding_mode, align_corners);

  int ix = static_cast<int>(x);
-  int iy = static_cast<int>(y); 
+  int iy = static_cast<int>(y);

  if (within_bounds_2d(iy, ix, H, W)) {
    return data[iy * sH + ix * sW];
@ -284,7 +284,7 @@ void add_value_bounded(
  y = compute_coordinates(y, H, padding_mode, align_corners);

  int ix = static_cast<int>(x);
-  int iy = static_cast<int>(y); 
+  int iy = static_cast<int>(y);

  safe_add_2d(data, iy, ix, sH, sW, H, W, delta);
 }
--- a/aten/src/ATen/native/cuda/Indexing.cu
+++ b/aten/src/ATen/native/cuda/Indexing.cu
@ -978,7 +978,7 @@ Tensor & masked_fill__cuda(Tensor& self, const Tensor & mask, Scalar value) {
      .add_output(self)
      .add_input(self)
      .add_input(b_mask)
-      .build();  
+      .build();

  if (b_mask.dtype() == at::ScalarType::Byte) {
    TORCH_WARN("masked_fill_ received a mask with dtype torch.uint8, this behavior is now deprecated," \
--- a/aten/src/ATen/native/cuda/Loss.cu
+++ b/aten/src/ATen/native/cuda/Loss.cu
@ -58,7 +58,7 @@ Tensor kl_div_backward_cuda(const Tensor& grad, const Tensor& input, const Tenso
        });
    });
  }
-  else { 
+  else {
    grad_input = -at::exp(target) * grad;
    if (reduction == at::Reduction::Mean) {
      grad_input /= input.numel();
--- a/aten/src/ATen/native/cuda/MiscUtils.h
+++ b/aten/src/ATen/native/cuda/MiscUtils.h
@ -91,7 +91,7 @@ struct MagmaStreamSyncGuard {

 static inline int cuda_int_cast(int64_t value, const char* varname) {
  auto result = static_cast<int>(value);
-  TORCH_CHECK(static_cast<int64_t>(result) == value, 
+  TORCH_CHECK(static_cast<int64_t>(result) == value,
              "cuda_int_cast: The value of ", varname, "(", (long long)value,
              ") is too large to fit into a int (", sizeof(int), " bytes)");
  return result;
--- a/aten/src/ATen/native/cuda/Normalization.cuh
+++ b/aten/src/ATen/native/cuda/Normalization.cuh
@ -555,7 +555,7 @@ __global__ void batch_norm_backward_elemt_kernel(
    const GenericPackedTensorAccessor<stat_accscalar_t, 1, DefaultPtrTraits, index_t> sum_dy_xmu,
    GenericPackedTensorAccessor<input_scalar_t, 3, DefaultPtrTraits, index_t> grad_input,
    const int* __restrict__ numel, const int world_size) {
-  
+
  int64_t div = 0;
  for (int i = 0; i < world_size; i ++) {
    div += numel[i];
@ -955,7 +955,7 @@ std::tuple<Tensor, Tensor> batch_norm_update_stats_cuda_template(
 }

 // welford kernel for c last tensor calculating mean/biased_variance/unbiased_variance
-// original apex name: welford_kernel_c_last 
+// original apex name: welford_kernel_c_last
 template
   <template<typename T> class VarTransform,
    typename scalar_t,
@ -1632,7 +1632,7 @@ at::Tensor batch_norm_backward_elemt_channels_last_cuda_template(
    });
  }
  C10_CUDA_KERNEL_LAUNCH_CHECK();
- 
+
  return grad_input;
 }

--- a/aten/src/ATen/native/cuda/PowKernel.cu
+++ b/aten/src/ATen/native/cuda/PowKernel.cu
@ -24,7 +24,7 @@ namespace {
 // applied to the result of the inline function, and thus the result is incorrect.
 //   e.g. if we use 1.0 / sqrt(2) for 2 ^ (-0.5) in MSVC, we get
 //          int(2 ^ (-0.5)) = int(1.0 / sqrt(2)) = int(1.0 / int(1.414)) = int(1.0 / 1) = 1
-//        However, the correct result is 
+//        However, the correct result is
 //          int(2 ^ (-0.5)) = int(1.0 / 1.414) = 0
 #ifdef _MSC_VER
 // Functions for pow
--- a/aten/src/ATen/native/cuda/ReduceMinMaxKernel.cu
+++ b/aten/src/ATen/native/cuda/ReduceMinMaxKernel.cu
@ -119,14 +119,14 @@ static void _aminmax_kernel_impl(
    const Tensor& self,
    int64_t dim,
    bool keepdim) {
-  at::TensorIterator iter = make_reduction("_aminmax", min_result, 
+  at::TensorIterator iter = make_reduction("_aminmax", min_result,
    max_result, self, dim, keepdim, self.scalar_type());
  AT_DISPATCH_ALL_TYPES_AND2(kHalf, kBool, self.scalar_type(), "_aminmax_cuda", [&]() {
    gpu_reduce_kernel<scalar_t, scalar_t>(
      iter,
      MinMaxOps<scalar_t, scalar_t, int32_t>{},
      thrust::pair<scalar_t, scalar_t>(
-        at::numeric_limits<scalar_t>::upper_bound(), 
+        at::numeric_limits<scalar_t>::upper_bound(),
        at::numeric_limits<scalar_t>::lower_bound()
      )
    );
--- a/aten/src/ATen/native/cudnn/Conv_v7.cpp
+++ b/aten/src/ATen/native/cudnn/Conv_v7.cpp
@ -553,7 +553,7 @@ static inline void split_batch_dim_to_32bit_out(
    const at::Tensor& input,
    const at::Tensor& weight,
    IntArrayRef padding, IntArrayRef stride, IntArrayRef dilation, int64_t groups,
-    bool benchmark, bool deterministic, bool allow_tf32, 
+    bool benchmark, bool deterministic, bool allow_tf32,
    int64_t max_worksize, func_t func_32bit) {
  constexpr int64_t int_max = std::numeric_limits<int>::max();
  const int64_t ni = input.numel();
--- a/aten/src/ATen/native/quantized/cpu/qnnpack/src/q8conv/4x8-aarch32-neon.S
+++ b/aten/src/ATen/native/quantized/cpu/qnnpack/src/q8conv/4x8-aarch32-neon.S
@ -28,7 +28,7 @@
 #  |out ch indx| 16
 #  |params     | 20
 #  |-----------|
-#  
+#

 #  After loading w pointer in ip reg.
 #  And after pushing r4-r8 and d8-d15 on stack
@ -42,7 +42,7 @@
 #  |out ch indx| 112
 #  |params     | 116
 #  |-----------|
-#  
+#

 # void pytorch_q8conv_ukernel_4x8__aarch32_neon(
 #     size_t mr,
--- a/aten/src/ATen/native/quantized/cpu/qnnpack/src/q8conv/8x8-aarch64-neon.S
+++ b/aten/src/ATen/native/quantized/cpu/qnnpack/src/q8conv/8x8-aarch64-neon.S
@ -16,8 +16,8 @@
 #  x2: kc
 #  x3: ks
 #  x4: a
-#  x5: w 
-#  x6: c 
+#  x5: w
+#  x6: c
 #  x7: c_stride
 #

--- a/aten/src/ATen/native/quantized/cpu/qnnpack/src/q8gemm/4x8-aarch32-neon.S
+++ b/aten/src/ATen/native/quantized/cpu/qnnpack/src/q8gemm/4x8-aarch32-neon.S
@ -28,7 +28,7 @@
 #  |out ch indx| 16
 #  |params     | 20
 #  |-----------|
-#  
+#

 #  After loading w pointer in ip reg.
 #  And after pushing r4-r9 and d8-d15 on stack
@ -42,7 +42,7 @@
 #  |out ch indx| 104
 #  |params     | 108
 #  |-----------|
-#  
+#

 #
 # New Struct for pytorch_qnnp_conv_quantization_params
--- a/aten/src/ATen/native/quantized/cpu/qnnpack/src/q8gemm/4x8-dq-aarch32-neon.S
+++ b/aten/src/ATen/native/quantized/cpu/qnnpack/src/q8gemm/4x8-dq-aarch32-neon.S
@ -41,7 +41,7 @@
 #  |out ch indx| 16
 #  |params     | 20
 #  |-----------|
-#  
+#

 #  After loading w pointer in ip reg.
 #  And after pushing r4-r8 and d8-d15 on stack
@ -56,7 +56,7 @@
 #  |out ch indx| 100
 #  |params     | 104
 #  |-----------|
-#  
+#

 # void pytorch_q8gemm_ukernel_4x8__aarch32_neon(
 #     size_t mr,
--- a/aten/src/ATen/native/quantized/cpu/qnnpack/src/q8gemm/8x8-aarch64-neon.S
+++ b/aten/src/ATen/native/quantized/cpu/qnnpack/src/q8gemm/8x8-aarch64-neon.S
@ -15,8 +15,8 @@
 #  x2: k
 #  x3: a
 #  x4: a_stride
-#  x5: w 
-#  x6: c 
+#  x5: w
+#  x6: c
 #  x7: c_stride
 #

--- a/aten/src/ATen/native/quantized/cpu/qnnpack/src/q8gemm_sparse/4x4-packA-aarch32-neon.S
+++ b/aten/src/ATen/native/quantized/cpu/qnnpack/src/q8gemm_sparse/4x4-packA-aarch32-neon.S
@ -21,15 +21,15 @@
 #  |----------------|
 #  |packed_a        | 0
 #  |----------------|
-#  
+#

 #  After loading w pointer in ip reg.
 #  And after pushing r4-r9 and d8-d15 on stack
 #  |----------------|
-#  |r4 - r11        | 0 
+#  |r4 - r11        | 0
 #  |packed_a        | 32
 #  |----------------|
-#  
+#

 # Packed A format.
 # 4kx4m blocks for alls blocks given 4 rows (4m) are placed in contiguous memory.
@ -42,7 +42,7 @@
 # |                     |        Thus Packed A has (K + 4 - 1)/4 * (M + 4 -1)/4 blocks
 # |                     |
 # |---------------------|
-# 
+#
 # Each 4 x 4 blocks is transposed and stored.
 # Each of the (K + 4 - 1)/4 blocks for a given group of 4 m blocks
 # are stored adjacent in memory
--- a/aten/src/ATen/native/quantized/cpu/qnnpack/src/q8gemm_sparse/4x8c1x4-dq-packedA-aarch32-neon.S
+++ b/aten/src/ATen/native/quantized/cpu/qnnpack/src/q8gemm_sparse/4x8c1x4-dq-packedA-aarch32-neon.S
@ -20,7 +20,7 @@
 ## Stack
 # 4     a_stride
 # 4     packed_w
-# 4     w_row_ptr 
+# 4     w_row_ptr
 # 4     w_block_ids_ptr
 # 4     b
 # 4     c
@ -43,7 +43,7 @@
 #  |out ch indx     | 24
 #  |params          | 28
 #  |----------------|
-#  
+#

 #  After loading w pointer in ip reg.
 #  And after pushing r4-r9 and d8-d15 on stack
@ -58,7 +58,7 @@
 #  |out ch indx     | 120
 #  |params          | 124
 #  |----------------|
-#  
+#

 # void pytorch_q8gemm_dq_sparse_1x4_ukernel_4x8_packedA__aarch32_neon(
 #     size_t mr,
@ -223,7 +223,7 @@ k_loop:
    # Each iteration produce 4 values each of 4 bytes
    # Thus 4 x 4 = 16 bytes 2^4
    # In this implementation, first value will be stored at
-    # 1st value: sp - 12 - r1 * 16 
+    # 1st value: sp - 12 - r1 * 16
    # 2nd value: sp - 12 - (r1 - 1) * 16
    # and so on.
    SUB r9, r9, r1, LSL #4
--- a/aten/src/ATen/native/quantized/cpu/qnnpack/src/q8gemm_sparse/4x8c8x1-dq-packedA-aarch32-neon.S
+++ b/aten/src/ATen/native/quantized/cpu/qnnpack/src/q8gemm_sparse/4x8c8x1-dq-packedA-aarch32-neon.S
@ -20,7 +20,7 @@
 ## Stack
 # 4     a_stride
 # 4     packed_w
-# 4     w_row_ptr 
+# 4     w_row_ptr
 # 4     w_block_ids_ptr
 # 4     b
 # 4     c
@ -43,7 +43,7 @@
 #  |out ch indx     | 24
 #  |params          | 28
 #  |----------------|
-#  
+#

 #  After loading w pointer in ip reg.
 #  And after pushing r4-r9 and d8-d15 on stack
@ -58,7 +58,7 @@
 #  |out ch indx     | 120
 #  |params          | 124
 #  |----------------|
-#  
+#

 # void pytorch_q8gemm_dq_sparse_8x1_ukernel_4x8_packedA__aarch32_neon(
 #     size_t mr,
--- a/aten/src/ATen/native/quantized/cpu/qnnpack/src/q8gemm_sparse/8x4-packA-aarch32-neon.S
+++ b/aten/src/ATen/native/quantized/cpu/qnnpack/src/q8gemm_sparse/8x4-packA-aarch32-neon.S
@ -21,15 +21,15 @@
 #  |----------------|
 #  |packed_a        | 0
 #  |----------------|
-#  
+#

 #  After loading w pointer in ip reg.
 #  And after pushing r4-r9 and d8-d15 on stack
 #  |----------------|
-#  |r4 - r11        | 0 
+#  |r4 - r11        | 0
 #  |packed_a        | 32
 #  |----------------|
-#  
+#

 # Packed A format.
 # 8kx4m blocks for alls blocks given 4 rows (4m) are placed in contiguous memory.
@ -42,7 +42,7 @@
 # |                     |        Thus Packed A has (K + 4 - 1)/4 * (M + 8 -1)/8 blocks
 # |                     |
 # |---------------------|
-# 
+#
 # Each 8 x 4 blocks is transposed and stored.
 # Each of the (K + 4 - 1)/4 blocks for a given group of 8 m blocks
 # are stored adjacent in memory
--- a/aten/src/ATen/native/quantized/cpu/qnnpack/src/q8gemm_sparse/8x4-packA-aarch64-neon.S
+++ b/aten/src/ATen/native/quantized/cpu/qnnpack/src/q8gemm_sparse/8x4-packA-aarch64-neon.S
@ -19,7 +19,7 @@
 # |                     |        Thus Packed A has (K + 4 - 1)/4 * (M + 8 -1)/8 blocks
 # |                     |
 # |---------------------|
-# 
+#
 # Each 8 x 4 blocks is transposed and stored.
 # Each of the (K + 4 - 1)/4 blocks for a given group of 8 m blocks
 # are stored adjacent in memory
--- a/aten/src/ATen/native/quantized/cpu/qnnpack/src/q8gemm_sparse/8x8c1x4-dq-packedA-aarch64-neon.S
+++ b/aten/src/ATen/native/quantized/cpu/qnnpack/src/q8gemm_sparse/8x8c1x4-dq-packedA-aarch64-neon.S
@ -278,7 +278,7 @@ k_loop:
    # v10 : x10, x11, x12, x13
    # v12 : x20, x21, x22, x23
    # v14 : x30, x31, x32, x33
-    # Then using 
+    # Then using
    # TRANSPOSE_4X4_S32 v16, v18, v20, v22, v4, v5, v6, v7
    # We get
    # v16 : x04, x05, x06, x07
--- a/aten/src/ATen/native/sparse/SoftMax.cpp
+++ b/aten/src/ATen/native/sparse/SoftMax.cpp
@ -635,7 +635,7 @@ Tensor _sparse_log_softmax(const Tensor& input_, const int64_t dim_, c10::option
  namedinference::propagate_names(result, input_);
  return result;
 }
-  
+
 Tensor _sparse_log_softmax(const Tensor& self, Dimname dim, optional<ScalarType> dtype) {
  return at::_sparse_log_softmax(self, dimname_to_position(self, dim), dtype);
 }
--- a/aten/src/ATen/native/sparse/SparseTensor.cpp
+++ b/aten/src/ATen/native/sparse/SparseTensor.cpp
@ -541,8 +541,8 @@ Tensor sparse_mask_helper_cpu(
      `t`             - coalesced sparse tensor input
      `mask_indices`  - mask indices tensor

-    Note: The nnz in the output tensor will be same as the `mask_indices`. So it will 
-    works independently if the mask is coalesced or not. 
+    Note: The nnz in the output tensor will be same as the `mask_indices`. So it will
+    works independently if the mask is coalesced or not.
  */
  TORCH_CHECK(t.is_sparse(), "t: input is not a sparse tensor");
  TORCH_CHECK(t.is_coalesced(), "t:  input is uncoalesced");
@ -554,7 +554,7 @@ Tensor sparse_mask_helper_cpu(
  auto t_v = t._values();
  auto vsize = t_v.sizes().vec();
  vsize[0] = r_nnz;
- 
+
  Tensor r_values = at::zeros(vsize, t_v.options());
  auto t_i = t._indices();
  auto t_nnz = t._nnz();
@ -583,7 +583,7 @@ Tensor sparse_mask_helper_cpu(
      }
    }
  });
-  return r_values; 
+  return r_values;
 }

 }} // namespace at::native
--- a/aten/src/ATen/native/sparse/SparseTensorMath.cpp
+++ b/aten/src/ATen/native/sparse/SparseTensorMath.cpp
@ -1116,7 +1116,7 @@ SparseTensor& _sspaddmm_out_cpu(
      "sspaddmm: Argument #1: Expected dim 1 size ", dim_k, ", got ", t.size(1));

  int64_t nnz        = sparse._nnz();
-  // We have to make indices contiguous as we use indices.data_ptr in _to_csr which assumes row-contiguous storage  
+  // We have to make indices contiguous as we use indices.data_ptr in _to_csr which assumes row-contiguous storage
  Tensor indices = sparse._indices().contiguous();
  Tensor values      = sparse._values();

--- a/aten/src/ATen/native/sparse/cuda/SparseCUDABlas.cu
+++ b/aten/src/ATen/native/sparse/cuda/SparseCUDABlas.cu
@ -144,8 +144,8 @@ void csrmm2(
  TORCH_CUDASPARSE_CHECK(cusparseCreateDnMat(
    &descC,               /* output */
    m, n, ldc,            /* rows, cols, leading dimension */
-    c,                    /* values */ 
-    cusparse_value_type,  /* data type of values */ 
+    c,                    /* values */
+    cusparse_value_type,  /* data type of values */
    CUSPARSE_ORDER_COL    /* memory layout, ONLY column-major is supported now */
  ));

--- a/aten/src/ATen/nnapi/codegen.py
+++ b/aten/src/ATen/nnapi/codegen.py
@ -105,7 +105,7 @@ def main(argv):
    out_dir = pathlib.Path(__file__).parent

    (out_dir / "nnapi_wrapper.h").write_text(
-        PREFIX + 
+        PREFIX +
        textwrap.dedent("""\
            #ifndef NNAPI_WRAPPER_H_
            #define NNAPI_WRAPPER_H_
@ -124,7 +124,7 @@ def main(argv):
    )

    (out_dir / "nnapi_wrapper.cpp").write_text(
-        PREFIX + 
+        PREFIX +
        textwrap.dedent("""\
            #ifndef _WIN32
            #include <dlfcn.h>
--- a/aten/src/ATen/nnapi/nnapi_bind.cpp
+++ b/aten/src/ATen/nnapi/nnapi_bind.cpp
@ -140,7 +140,7 @@ struct NnapiCompilation : torch::jit::CustomClassHolder {
    }

    check_nnapi->Execution_compute(execution);
-    
+
    // TODO: Maybe skip this for fixed-size outputs?
    for (size_t i = 0; i < outputs.size(); i++) {
      auto& t = outputs[i];
--- a/aten/src/ATen/test/cpu_generator_test.cpp
+++ b/aten/src/ATen/test/cpu_generator_test.cpp
@ -18,7 +18,7 @@ TEST(CPUGeneratorImpl, TestGeneratorDynamicCast) {
 }

 TEST(CPUGeneratorImpl, TestDefaultGenerator) {
-  // Test Description: 
+  // Test Description:
  // Check if default generator is created only once
  // address of generator should be same in all calls
  auto foo = at::detail::getDefaultCPUGenerator();
@ -27,7 +27,7 @@ TEST(CPUGeneratorImpl, TestDefaultGenerator) {
 }

 TEST(CPUGeneratorImpl, TestCloning) {
-  // Test Description: 
+  // Test Description:
  // Check cloning of new generators.
  // Note that we don't allow cloning of other
  // generator states into default generators.
@ -47,9 +47,9 @@ void thread_func_get_engine_op(CPUGeneratorImpl* generator) {
 }

 TEST(CPUGeneratorImpl, TestMultithreadingGetEngineOperator) {
-  // Test Description: 
+  // Test Description:
  // Check CPUGeneratorImpl is reentrant and the engine state
-  // is not corrupted when multiple threads request for 
+  // is not corrupted when multiple threads request for
  // random samples.
  // See Note [Acquire lock when using random generators]
  auto gen1 = at::detail::createCPUGenerator();
@ -74,7 +74,7 @@ TEST(CPUGeneratorImpl, TestMultithreadingGetEngineOperator) {
 }

 TEST(CPUGeneratorImpl, TestGetSetCurrentSeed) {
-  // Test Description: 
+  // Test Description:
  // Test current seed getter and setter
  // See Note [Acquire lock when using random generators]
  auto foo = at::detail::getDefaultCPUGenerator();
@ -92,7 +92,7 @@ void thread_func_get_set_current_seed(Generator generator) {
 }

 TEST(CPUGeneratorImpl, TestMultithreadingGetSetCurrentSeed) {
-  // Test Description: 
+  // Test Description:
  // Test current seed getter and setter are thread safe
  // See Note [Acquire lock when using random generators]
  auto gen1 = at::detail::getDefaultCPUGenerator();
@ -107,7 +107,7 @@ TEST(CPUGeneratorImpl, TestMultithreadingGetSetCurrentSeed) {
 }

 TEST(CPUGeneratorImpl, TestRNGForking) {
-  // Test Description: 
+  // Test Description:
  // Test that state of a generator can be frozen and
  // restored
  // See Note [Acquire lock when using random generators]
@ -124,7 +124,7 @@ TEST(CPUGeneratorImpl, TestRNGForking) {
  ASSERT_EQ(target_value.sum().item<double>(), forked_value.sum().item<double>());
 }

-/** 
+/**
 * Philox CPU Engine Tests
 */

@ -208,7 +208,7 @@ TEST(CPUGeneratorImpl, TestMT19937EngineReproducibility) {
  // Test Description:
  //   Tests if same inputs give same results when compared
  //   to std.
-  
+
  // test with zero seed
  at::mt19937 engine1(0);
  std::mt19937 engine2(0);
@ -231,5 +231,5 @@ TEST(CPUGeneratorImpl, TestMT19937EngineReproducibility) {
  for(int i = 0; i < 10000; i++) {
    ASSERT_EQ(engine1(), engine2());
  }
-  
+
 }
--- a/aten/src/ATen/test/cuda_generator_test.cu
+++ b/aten/src/ATen/test/cuda_generator_test.cu
@ -80,7 +80,7 @@ __global__ void testEngineOffset2(){
  unsigned long long increment_val = ::ldexp(1.0, 64);
  at::Philox4_32_10 engine1(123, 0, increment_val);
  at::Philox4_32_10 engine2(123, increment_val, increment_val);
-  
+
  engine2.incr_n(increment_val);
  engine2.incr();
  assert(engine1() == engine2());
@ -166,7 +166,7 @@ TEST(CUDAGeneratorImpl, TestGeneratorDynamicCast) {
 }

 TEST(CUDAGeneratorImpl, TestDefaultGenerator) {
-  // Test Description: 
+  // Test Description:
  // Check if default generator state is created only once
  // address of generator should be same in all calls
  if (!at::cuda::is_available()) return;
@ -186,7 +186,7 @@ TEST(CUDAGeneratorImpl, TestDefaultGenerator) {
 }

 TEST(CUDAGeneratorImpl, TestCloning) {
-  // Test Description: 
+  // Test Description:
  // Check cloning of new generators.
  // Note that we don't allow cloning of other
  // generator states into default generators.
@ -211,9 +211,9 @@ void thread_func_get_set_current_seed(Generator generator) {
  current_seed++;
  generator.set_current_seed(current_seed);
 }
-  
+
 TEST(CUDAGeneratorImpl, TestMultithreadingGetSetCurrentSeed) {
-  // Test Description: 
+  // Test Description:
  // Test current seed getter and setter are thread safe
  // See Note [Acquire lock when using random generators]
  if (!at::cuda::is_available()) return;
@ -229,7 +229,7 @@ TEST(CUDAGeneratorImpl, TestMultithreadingGetSetCurrentSeed) {
 }

 TEST(CUDAGeneratorImpl, TestRNGForking) {
-  // Test Description: 
+  // Test Description:
  // Test that state of a generator can be frozen and
  // restored
  // See Note [Acquire lock when using random generators]
--- a/aten/src/TH/vector/VSX.cpp
+++ b/aten/src/TH/vector/VSX.cpp
@ -765,7 +765,7 @@ int main()

    test_THDoubleVector_fill_VSX();
    test_THFloatVector_fill_VSX();
- 
+
    test_THDoubleVector_muls_VSX();
    test_THFloatVector_muls_VSX();

--- a/aten/src/TH/vector/simd.h
+++ b/aten/src/TH/vector/simd.h
@ -93,7 +93,7 @@ static inline uint32_t detectHostSIMDExtensions()
 }

 #endif
- 
+
 #elif defined(__EMSCRIPTEN__)

 static inline uint32_t detectHostSIMDExtensions()
--- a/aten/src/THC/THCAtomics.cuh
+++ b/aten/src/THC/THCAtomics.cuh
@ -294,9 +294,9 @@ inline __device__ at::BFloat16 gpuAtomicMul(at::BFloat16 * address, at::BFloat16
  return AtomicFPOp<at::BFloat16>()(address, val,
                                    [](at::BFloat16 bsum, at::BFloat16 val) {
                                      return THCNumerics<at::BFloat16>::mul(bsum, val);
-                                    });    
+                                    });
 }
-    
+
 inline __device__ double gpuAtomicMul(double * address, double val) {
  return AtomicFPOp<double>()(address, val,
                              [](double val, unsigned long long int assumed) {
--- a/benchmarks/cpp/tensorexpr/bench_reduce.cpp
+++ b/benchmarks/cpp/tensorexpr/bench_reduce.cpp
@ -270,7 +270,7 @@ BENCHMARK_DEFINE_F(Reduce1D, TeSplitTail)(benchmark::State& state) {
    te::For* tail;
    loop.splitWithTail(m, kChunkSize, &mo, &mi, &tail);
  }
-  
+
  loop.prepareForCodegen();
  te::Stmt* s = loop.root_stmt();
  s = te::IRSimplifier::simplify(s);
@ -313,7 +313,7 @@ BENCHMARK_DEFINE_F(Reduce1D, TeSplitMask)(benchmark::State& state) {
    te::For* mi;
    loop.splitWithMask(m, kChunkSize, &mo, &mi);
  }
-  
+
  loop.prepareForCodegen();
  te::Stmt* s = loop.root_stmt();
  s = te::IRSimplifier::simplify(s);
@ -369,7 +369,7 @@ BENCHMARK_DEFINE_F(Reduce1D, TeRfactorV1)(benchmark::State& state) {
    auto bt_body = te::NodeFinder<te::ReduceOp>::find(loop.root_stmt())[0];
    loop.rfactor(bt_body, mi->var());
  }
-  
+
  loop.prepareForCodegen();
  te::Stmt* s = loop.root_stmt();
  s = te::IRSimplifier::simplify(s);
@ -419,7 +419,7 @@ BENCHMARK_DEFINE_F(Reduce1D, TeRfactorV2)(benchmark::State& state) {

  {
    // Look for the new For and vectorize, but rfactor didn't return the newly added "For *".
-    // Resort to a hack to find the lost "For *". 
+    // Resort to a hack to find the lost "For *".
    // TODO: make it easier to find the transformed loop after rfactor.
    auto loops = te::NodeFinder<te::For>::find(loop.root_stmt());
    TORCH_CHECK(loops.size() == 4);
--- a/benchmarks/distributed/rpc/rl/agent.py
+++ b/benchmarks/distributed/rpc/rl/agent.py
@ -163,7 +163,7 @@ class AgentBase:
        r"""
        Finishes the episode
        Args:
-            rets (list): List containing rewards generated by selct action calls during 
+            rets (list): List containing rewards generated by selct action calls during
            episode run
        """
        return self.agent_latency, self.agent_throughput
--- a/benchmarks/distributed/rpc/rl/coordinator.py
+++ b/benchmarks/distributed/rpc/rl/coordinator.py
@ -48,13 +48,13 @@ class CoordinatorBase:

    def run_coordinator(self, episodes, episode_steps, queue):
        r"""
-            Runs n benchmark episodes.  Each episode is started by coordinator telling each 
-            observer to contact the agent.  Each episode is concluded by coordinator telling agent 
+            Runs n benchmark episodes.  Each episode is started by coordinator telling each
+            observer to contact the agent.  Each episode is concluded by coordinator telling agent
            to finish the episode, and then the coordinator records benchmark data
            Args:
                episodes (int): Number of episodes to run
                episode_steps (int): Number steps to be run in each episdoe by each observer
-                queue (SimpleQueue): SimpleQueue from torch.multiprocessing.get_context() for 
+                queue (SimpleQueue): SimpleQueue from torch.multiprocessing.get_context() for
                                     saving benchmark run results to
        """

@ -96,9 +96,9 @@ class CoordinatorBase:
        observer_throughput_final = [
            t for s in observer_throughput_final for t in s]

-        benchmark_metrics = {'agent latency (seconds)': {}, 
-                             'agent throughput': {}, 
-                             'observer latency (seconds)': {}, 
+        benchmark_metrics = {'agent latency (seconds)': {},
+                             'agent throughput': {},
+                             'observer latency (seconds)': {},
                             'observer throughput': {}}


--- a/benchmarks/distributed/rpc/rl/launcher.py
+++ b/benchmarks/distributed/rpc/rl/launcher.py
@ -44,19 +44,19 @@ args = vars(args)

 def run_worker(rank, world_size, master_addr, master_port, batch, state_size, nlayers, out_features, queue):
    r"""
-    inits an rpc worker 
+    inits an rpc worker
    Args:
        rank (int): Rpc rank of worker machine
        world_size (int): Number of workers in rpc network (number of observers +
                          1 agent + 1 coordinator)
        master_addr (str): Master address of cooridator
        master_port (str): Master port of coordinator
-        batch (bool): Whether agent will use batching or process one observer 
+        batch (bool): Whether agent will use batching or process one observer
                      request a at a time
        state_size (str): Numerical str representing state dimensions (ie: 5-15-10)
        nlayers (int): Number of layers in model
        out_features (int): Number of out features in model
-        queue (SimpleQueue): SimpleQueue from torch.multiprocessing.get_context() for 
+        queue (SimpleQueue): SimpleQueue from torch.multiprocessing.get_context() for
                             saving benchmark run results to
    """
    state_size = list(map(int, state_size.split('-')))
@ -82,9 +82,9 @@ def find_graph_variable(args):
    r"""
    Determines if user specified multiple entries for a single argument, in which case
    benchmark is run for each of these entries.  Comma separated values in a given argument indicate multiple entries.
-    Output is presented so that user can use plot repo to plot the results with each of the 
-    variable argument's entries on the x-axis. Args is modified in accordance with this.  
-    More than 1 argument with multiple entries is not permitted.  
+    Output is presented so that user can use plot repo to plot the results with each of the
+    variable argument's entries on the x-axis. Args is modified in accordance with this.
+    More than 1 argument with multiple entries is not permitted.
    Args:
        args (dict): Dictionary containing arguments passed by the user (and default arguments)
    """
@ -138,12 +138,12 @@ def print_benchmark_results(report):
    if x_axis_name:
        x_axis_output_label = f'{x_axis_name} |'
        heading += append_spaces(x_axis_output_label, col_width)
-    metric_headers = ['agent latency (seconds)', 'agent throughput', 
+    metric_headers = ['agent latency (seconds)', 'agent throughput',
                      'observer latency (seconds)', 'observer throughput']
    percentile_subheaders = ['p50', 'p75', 'p90', 'p95']
    subheading = ""
    if x_axis_name:
-        subheading += append_spaces(' ' * (len(x_axis_output_label) - 1), col_width) 
+        subheading += append_spaces(' ' * (len(x_axis_output_label) - 1), col_width)
    for header in metric_headers:
        heading += append_spaces(header, col_width * len(percentile_subheaders))
        for percentile in percentile_subheaders:
@ -163,7 +163,7 @@ def print_benchmark_results(report):

 def main():
    r"""
-    Runs rpc benchmark once if no argument has multiple entries, and otherwise once for each of the multiple entries. 
+    Runs rpc benchmark once if no argument has multiple entries, and otherwise once for each of the multiple entries.
    Multiple entries is indicated by comma separated values, and may only be done for a single argument.
    Results are printed as well as saved to output file.  In case of multiple entries for a single argument,
    the plot repo can be used to benchmark results on the y axis with each entry on the x axis.
@ -171,7 +171,7 @@ def main():
    find_graph_variable(args)

    # run once if no x axis variables
-    x_axis_variables = args[args['x_axis_name']] if args.get('x_axis_name') else [None]  
+    x_axis_variables = args[args['x_axis_name']] if args.get('x_axis_name') else [None]
    ctx = mp.get_context('spawn')
    queue = ctx.SimpleQueue()
    benchmark_runs = []
@ -197,7 +197,7 @@ def main():
        print(f"Time taken benchmark run {i} -, {time.time() - start_time}")
        if args.get('x_axis_name'):
            # save x axis value was for this iteration in the results
-            benchmark_run_results[args['x_axis_name']] = x_axis_variable  
+            benchmark_run_results[args['x_axis_name']] = x_axis_variable
        benchmark_runs.append(benchmark_run_results)

    report = args
--- a/benchmarks/fastrnns/README.md
+++ b/benchmarks/fastrnns/README.md
@ -1,6 +1,6 @@
 # Fast RNN benchmarks

-Benchmarks for TorchScript models 
+Benchmarks for TorchScript models

 For most stable results, do the following:
 - Set CPU Governor to performance mode (as opposed to energy save)
@ -24,7 +24,7 @@ or run the test independently:

 should give a good comparison, or you can specify the type of model to run

-`python -m fastrnns.bench --rnns cudnn aten jit --group rnns` 
+`python -m fastrnns.bench --rnns cudnn aten jit --group rnns`

 ## Run model profiling, calls nvprof

@ -33,7 +33,7 @@ should give a good comparison, or you can specify the type of model to run
 should generate nvprof file for all models somewhere.
 you can also specify the models to generate nvprof files separately:

-`python -m fastrnns.profile --rnns aten jit` 
+`python -m fastrnns.profile --rnns aten jit`

 ### Caveats

--- a/benchmarks/operator_benchmark/c2/add_test.py
+++ b/benchmarks/operator_benchmark/c2/add_test.py
@ -1,16 +1,16 @@
 import operator_benchmark as op_bench
 import benchmark_caffe2 as op_bench_c2
 from benchmark_caffe2 import Caffe2BenchmarkBase # noqa
-from caffe2.python import core 
+from caffe2.python import core


 """Microbenchmarks for element-wise Add operator. Supports both Caffe2/PyTorch."""

-# Configs for C2 add operator 
+# Configs for C2 add operator
 add_long_configs = op_bench.cross_product_configs(
    M=[8, 64, 128],
    N=range(2, 10, 3),
-    K=[2 ** x for x in range(0, 3)], 
+    K=[2 ** x for x in range(0, 3)],
    dtype=["int", "float"],
    tags=["long"]
 )
@ -22,20 +22,20 @@ add_short_configs = op_bench.config_list(
        [16, 16, 64, "float"],
        [64, 64, 128, "int"],
    ],
-    attr_names=["M", "N", "K", "dtype"], 
-    tags=["short"], 
+    attr_names=["M", "N", "K", "dtype"],
+    tags=["short"],
 )

 class AddBenchmark(op_bench_c2.Caffe2BenchmarkBase):
-    def init(self, M, N, K, dtype): 
-        self.input_one = self.tensor([M, N, K], dtype) 
-        self.input_two = self.tensor([M, N, K], dtype) 
+    def init(self, M, N, K, dtype):
+        self.input_one = self.tensor([M, N, K], dtype)
+        self.input_two = self.tensor([M, N, K], dtype)
        self.output = self.tensor([M, N, K], dtype)
        self.set_module_name("add")

    def forward(self):
        op = core.CreateOperator(
-            "Add", [self.input_one, self.input_two], self.output, **self.args 
+            "Add", [self.input_one, self.input_two], self.output, **self.args
        )
        return op

--- a/benchmarks/operator_benchmark/c2/matmul_test.py
+++ b/benchmarks/operator_benchmark/c2/matmul_test.py
@ -2,7 +2,7 @@
 import operator_benchmark as op_bench
 import benchmark_caffe2 as op_bench_c2
 from benchmark_caffe2 import Caffe2BenchmarkBase # noqa
-from caffe2.python import core 
+from caffe2.python import core

 """Microbenchmarks for MatMul operator"""

@ -10,7 +10,7 @@ from caffe2.python import core
 mm_long_configs = op_bench.cross_product_configs(
    M=[8, 64, 128],
    N=range(2, 10, 3),
-    K=[2 ** x for x in range(0, 3)], 
+    K=[2 ** x for x in range(0, 3)],
    trans_a=[True, False],
    trans_b=[True, False],
    tags=["long"]
@ -23,13 +23,13 @@ mm_short_configs = op_bench.config_list(
        [1024, 1024, 256, True, False],
        [8192, 8192, 1024, True, False],
    ],
-    attr_names=["M", "N", "K", "trans_a", "trans_b"], 
-    tags=["short"], 
+    attr_names=["M", "N", "K", "trans_a", "trans_b"],
+    tags=["short"],
 )


 class MatMulBenchmark(op_bench_c2.Caffe2BenchmarkBase):
-    def init(self, M, N, K, trans_a, trans_b): 
+    def init(self, M, N, K, trans_a, trans_b):
        self.input_one = self.tensor([N, M]) if trans_a else self.tensor([M, N])
        self.input_two = self.tensor([K, N]) if trans_b else self.tensor([N, K])
        self.args = {'trans_a': trans_a, 'trans_b': trans_b}
@ -38,7 +38,7 @@ class MatMulBenchmark(op_bench_c2.Caffe2BenchmarkBase):

    def forward(self):
        op = core.CreateOperator(
-            "MatMul", [self.input_one, self.input_two], self.output, **self.args 
+            "MatMul", [self.input_one, self.input_two], self.output, **self.args
        )
        return op

--- a/benchmarks/operator_benchmark/common/tests/add_ops_list_test.py
+++ b/benchmarks/operator_benchmark/common/tests/add_ops_list_test.py
@ -22,7 +22,7 @@ unary_ops_list = op_bench.op_list(


 class UnaryOpBenchmark(op_bench.TorchBenchmarkBase):
-    def init(self, M, N, op_func): 
+    def init(self, M, N, op_func):
        self.input_one = torch.rand(M, N)
        self.op_func = op_func

--- a/benchmarks/operator_benchmark/common/tests/c2_cpu_gpu_forward_backward_test.py
+++ b/benchmarks/operator_benchmark/common/tests/c2_cpu_gpu_forward_backward_test.py
@ -1,5 +1,5 @@
 import operator_benchmark as op_bench
-from caffe2.python import core 
+from caffe2.python import core


 add_configs = op_bench.cross_product_configs(
@ -11,24 +11,24 @@ add_configs = op_bench.cross_product_configs(
 )

 class AddBenchmark(op_bench.Caffe2BenchmarkBase):
-    def init(self, M, N, K, device): 
+    def init(self, M, N, K, device):
        self.set_module_name("add")
-        self.input_one = self.tensor([M, N, K], device=device) 
-        self.input_two = self.tensor([M, N, K], device=device) 
-        self.input_one_grad = self.tensor([M, N, K], device=device) 
-        self.input_two_grad = self.tensor([M, N, K], device=device) 
+        self.input_one = self.tensor([M, N, K], device=device)
+        self.input_two = self.tensor([M, N, K], device=device)
+        self.input_one_grad = self.tensor([M, N, K], device=device)
+        self.input_two_grad = self.tensor([M, N, K], device=device)
        self.output = self.tensor([M, N, K], device=device)

    def forward(self):
        op = core.CreateOperator(
-            "Add", [self.input_one, self.input_two], self.output, **self.args 
+            "Add", [self.input_one, self.input_two], self.output, **self.args
        )
        return op

    def backward(self):
        grad_op = core.CreateOperator(
-            "AddGradient", [self.output, self.input_one, self.input_two], 
-            [self.input_one_grad, self.input_two_grad], **self.args 
+            "AddGradient", [self.output, self.input_one, self.input_two],
+            [self.input_one_grad, self.input_two_grad], **self.args
        )
        return grad_op

--- a/benchmarks/operator_benchmark/common/tests/jit_forward_test.py
+++ b/benchmarks/operator_benchmark/common/tests/jit_forward_test.py
@ -5,8 +5,8 @@ intraop_bench_configs = op_bench.config_list(
    attrs=[
        [8, 16],
    ],
-    attr_names=["M", "N"], 
-    tags=["short"], 
+    attr_names=["M", "N"],
+    tags=["short"],
 )

@torch.jit.script
@ -24,9 +24,9 @@ class TorchSumBenchmark(op_bench.TorchBenchmarkBase):
        self.input_one = torch.rand(M, N)
        self.set_module_name("sum")

-    # This is a very temporary method and will be removed soon, so 
+    # This is a very temporary method and will be removed soon, so
    # don't use this method in your benchmark
-    # TODO(mingzhe): use one forward method for both JIT and Eager 
+    # TODO(mingzhe): use one forward method for both JIT and Eager
    def jit_forward(self, iters):
        return torch_sumall(self.input_one, iters)

--- a/benchmarks/operator_benchmark/common/tests/pt_backward_test.py
+++ b/benchmarks/operator_benchmark/common/tests/pt_backward_test.py
@ -10,9 +10,9 @@ add_configs = op_bench.cross_product_configs(
 )

 # This benchmark uses the auto_set to automatically set requires_grad
-# for both inputs. The test name can also be used for filtering. 
+# for both inputs. The test name can also be used for filtering.
 class AddBenchmark(op_bench.TorchBenchmarkBase):
-    def init(self, M, N, K): 
+    def init(self, M, N, K):
        self.input_one = torch.rand(M, N, K, requires_grad=self.auto_set())
        self.input_two = torch.rand(M, N, K, requires_grad=self.auto_set())
        self.set_module_name("add")
--- a/benchmarks/operator_benchmark/common/tests/pt_configs_list_test.py
+++ b/benchmarks/operator_benchmark/common/tests/pt_configs_list_test.py
@ -4,7 +4,7 @@ import torch
 """Microbenchmarks for element-wise Add operator. Supports both Caffe2/PyTorch."""

 add_short_configs = op_bench.config_list(
-    attr_names=['M', 'N', 'K'], 
+    attr_names=['M', 'N', 'K'],
    attrs=[
        [8, 16, 32],
        [16, 16, 64],
@ -14,12 +14,12 @@ add_short_configs = op_bench.config_list(
        'device': ['cpu', 'cuda'],
        'dtype': [torch.float, torch.float64],
    },
-    tags=['short'], 
+    tags=['short'],
 )


 class AddBenchmark(op_bench.TorchBenchmarkBase):
-    def init(self, M, N, K, device, dtype): 
+    def init(self, M, N, K, device, dtype):
        self.input_one = torch.rand(M, N, K, device=device, dtype=dtype, requires_grad=True)
        self.input_two = torch.rand(M, N, K, device=device, dtype=dtype)
        self.set_module_name('add')
--- a/benchmarks/operator_benchmark/common/tests/pt_cpu_gpu_forward_backward_test.py
+++ b/benchmarks/operator_benchmark/common/tests/pt_cpu_gpu_forward_backward_test.py
@ -12,7 +12,7 @@ add_configs = op_bench.cross_product_configs(


 class AddBenchmark(op_bench.TorchBenchmarkBase):
-    def init(self, M, N, K, device): 
+    def init(self, M, N, K, device):
        self.input_one = torch.rand(M, N, K, device=device, requires_grad=True)
        self.input_two = torch.rand(M, N, K, device=device, requires_grad=True)
        self.set_module_name("add")
--- a/benchmarks/operator_benchmark/common/tests/random_sample_test.py
+++ b/benchmarks/operator_benchmark/common/tests/random_sample_test.py
@ -7,19 +7,19 @@ configs = op_bench.random_sample_configs(
    N=[7, 8, 9, 10, 11, 12],
    K=[13, 14, 15, 16, 17, 18],
    # probs saves the weights of each value
-    probs=op_bench.attr_probs( 
+    probs=op_bench.attr_probs(
        M=[0.5, 0.2, 0.1, 0.05, 0.03, 0.1],
        N=[0.1, 0.3, 0.4, 0.02, 0.03, 0.04],
        K=[0.03, 0.6, 0.04, 0.02, 0.03, 0.01],
    ),
-    # this is the number of returned inputs 
-    total_samples=10, 
+    # this is the number of returned inputs
+    total_samples=10,
    tags=["short"],
 )


 class AddBenchmark(op_bench.TorchBenchmarkBase):
-    def init(self, M, N, K): 
+    def init(self, M, N, K):
        self.input_one = torch.rand(M, N, K)
        self.input_two = torch.rand(M, N, K)
        self.set_module_name("add")
--- a/benchmarks/sparse/matmul_dlmc_bench.py
+++ b/benchmarks/sparse/matmul_dlmc_bench.py
@ -1,9 +1,9 @@
 # Sparse benchmarks

-# These benchmarks are for the sparse matrix functionality. 
+# These benchmarks are for the sparse matrix functionality.
 # They exist for comparing the performance of sparse matrix routines
 # torch.sparse.mm(sparse, sparse)` with different backends (CPU/CUDA)
-# and with other frameworks such as scipy. 
+# and with other frameworks such as scipy.

 import sys
 from scipy import sparse
--- a/benchmarks/sparse/test.sh
+++ b/benchmarks/sparse/test.sh
@ -3,9 +3,9 @@
 DATASET_ROOT_DIR=$HOME/datasets/

 # wget https://storage.googleapis.com/sgk-sc2020/dlmc.tar.gz -P $DATASET_ROOT_DIR
-# tar -xvf $DATASET_ROOT_DIR/dlmc.tar.gz 
+# tar -xvf $DATASET_ROOT_DIR/dlmc.tar.gz

-echo "!! SPARSE SPMS TIME BENCHMARK!! " 
+echo "!! SPARSE SPMS TIME BENCHMARK!! "

 python matmul_dlmc_bench.py --path $DATASET_ROOT_DIR/dlmc/rn50 --dataset random_pruning --operation matmul --output /tmp/matmul_bench.pkl
 python matmul_dlmc_bench.py --path $DATASET_ROOT_DIR/dlmc/rn50 --dataset random_pruning --operation backward --output /tmp/backward_bench.pkl
--- a/Show More
+++ b/Show More