Add the OpenMP optimization for BatchPermutation. (#12153)

Summary: This is for Caffe2 optimization. WIth this optimization, the following two ops can boost a lot. (Test with MaskRCNN, on SKX8180 one socket) BatchPermutation op: reduced from 8.296387 ms to 1.4501984 ms. Pull Request resolved: https://github.com/pytorch/pytorch/pull/12153 Differential Revision: D10362823 Pulled By: ezyang fbshipit-source-id: 04d1486f6c7db49270992cd8cde41092154e62ee
2025-10-20 21:14:14 +08:00 · 2018-10-16 20:20:42 -07:00
parent 3709734b1c
commit 5416260b1e
1 changed files with 7 additions and 0 deletions
--- a/modules/detectron/batch_permutation_op.cc
+++ b/modules/detectron/batch_permutation_op.cc
@ -100,6 +100,13 @@ bool BatchPermutationOp<float, CPUContext>::RunOnDevice() {
  const float *src = X.template data<float>();
  float *dst = Y->template mutable_data<float>();

+#ifdef _OPENMP
+#if (_OPENMP >= 201307)
+#pragma omp parallel for simd
+#else
+#pragma omp parallel for
+#endif 
+#endif  
  for (int i = 0; i < N; i++) {
    int idx = indices.template data<int>()[i];