Add the OpenMP optimization for BatchPermutation. (#12153)

Summary:
This is for Caffe2 optimization.
WIth this optimization, the following two ops can boost a lot. (Test with MaskRCNN, on SKX8180 one socket)
BatchPermutation op: reduced from 8.296387 ms to 1.4501984 ms.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/12153

Differential Revision: D10362823

Pulled By: ezyang

fbshipit-source-id: 04d1486f6c7db49270992cd8cde41092154e62ee
This commit is contained in:
ChongyuIntel
2018-10-16 20:20:42 -07:00
committed by Facebook Github Bot
parent 3709734b1c
commit 5416260b1e

View File

@ -100,6 +100,13 @@ bool BatchPermutationOp<float, CPUContext>::RunOnDevice() {
const float *src = X.template data<float>();
float *dst = Y->template mutable_data<float>();
#ifdef _OPENMP
#if (_OPENMP >= 201307)
#pragma omp parallel for simd
#else
#pragma omp parallel for
#endif
#endif
for (int i = 0; i < N; i++) {
int idx = indices.template data<int>()[i];