15 Commits

Author SHA1 Message Date
e1e6417d4c Add SVE implementation of embedding_lookup_idx (#133995)
Adds an accelerated version of the embedding_lookup_idx perfkernels. This is done via a python codegen file similarly to `caffe2/perfkernels/hp_emblookup_codegen.py`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133995
Approved by: https://github.com/malfet, https://github.com/huydhn
2024-10-15 18:52:44 +00:00
dac0b4e62b Revert "Add SVE implementation of embedding_lookup_idx (#133995)"
This reverts commit 770c134998d3422bc2fa3b90baa235ed0c409e62.

Reverted https://github.com/pytorch/pytorch/pull/133995 on behalf of https://github.com/clee2000 due to breaking internal tests, I wondering if this just needs a targets change for buck? ([comment](https://github.com/pytorch/pytorch/pull/133995#issuecomment-2414596554))
2024-10-15 17:23:50 +00:00
770c134998 Add SVE implementation of embedding_lookup_idx (#133995)
Adds an accelerated version of the embedding_lookup_idx perfkernels. This is done via a python codegen file similarly to `caffe2/perfkernels/hp_emblookup_codegen.py`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133995
Approved by: https://github.com/malfet, https://github.com/huydhn
2024-10-14 10:17:27 +00:00
2e4c89eba9 [torch] Unify batch_box_cox implementations into perfkernels folder (#86569)
Summary:
1) Adding MKL/AVX2 based implementation into perfkernels. This implementation is similar to caffe2/operators/batch_box_cox_op.cc
2) Migrating batch_box_cox_op of caffe2 use this implementation

Test Plan: CI

Differential Revision: D40208074

Pull Request resolved: https://github.com/pytorch/pytorch/pull/86569
Approved by: https://github.com/hyuen
2022-10-23 19:29:25 +00:00
06d1be2447 [NOOP][clangformat][codemod] Enable CLANGFORMAT for caffe2/caffe2/* (#67624)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/67624

Test Plan: Visual inspection. Sandcastle.

Reviewed By: malfet

Differential Revision: D31986628

fbshipit-source-id: c872bded7325997a2945dbf5d4d052628dcb3659
2021-11-02 22:14:04 -07:00
7576cf8d00 [caffe2] Use cpuinfo in perfkernels to simplify build dependency (#36371)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36371

It allows to drop circular dependency and remove unknown_symbols in Buck build.

It'd be good to get rid of GetCpuId all together in favor of cpuinfo, but it's not really blocking anything

Reviewed By: malfet

Differential Revision: D20958000

fbshipit-source-id: ed17a2a90a51dc1adf9e634af56c85f0689f8f29
2020-04-10 13:26:34 -07:00
e372f42110 [caffe2] Explicit vectorization of LSTM operator (#35556)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35556

Pull Request resolved: https://github.com/pytorch/pytorch/pull/35542

Apply explicit vectorization to lstm_unit operator.
Enabled by -DENABLE_VECTORIZATION=1

This optimization requires vector library support and was tested with Intel SVML & clang.
However, compiler which support OpenMP4.5 with omp simd extention should also benefit.

After the code changes
In file included from caffe2/caffe2/operators/lstm_unit_op.cc:1:
caffe2/caffe2/operators/lstm_unit_op.h:60:1: remark: vectorized loop (vectorization width: 8, interleaved count: 1) [-Rpass=loop-vectorize]
VECTOR_LOOP for (int d = 0; d < D; ++d) {

caffe2/caffe2/operators/lstm_unit_op.h:60:1: remark: vectorized loop (vectorization width: 8, interleaved count: 1) [-Rpass=loop-vectorize]
caffe2/caffe2/operators/lstm_unit_op.h:112:1: remark: vectorized loop (vectorization width: 8, interleaved count: 1) [-Rpass=loop-vectorize]
VECTOR_LOOP for (int d = 0; d < D; ++d) {

Test Plan:
Check failures at OSS CI
- No build failures related to this change
- Failing tests are:
  - py3.6-clang7-rocmdeb-ubuntu16.04-test2
>RuntimeError: fft: ATen not compiled with MKL support
  - caffe2_onnx_ort2_py3_6_clang7_ubuntu16_04_test -
>gradient_check_test.py::TestMakeTwo
Exited with code exit status 1
- pytorch_macos_10_13_py3_test , Test errors like:
> ERROR [0.014s]: test_boolean_indexing_weirdness_cpu (__main__.NumpyTestsCPU)
RuntimeError: shape mismatch: indexing tensors could not be broadcast together with shapes [0], [2]
- caffe2_onnx_ort1_py3_6_clang7_ubuntu16_04_test
  - No failure info

Reviewed By: jspark1105

Differential Revision: D20484640

fbshipit-source-id: 8fb82dbd6698c8de3e0bbbc0b48d15b70e36ca94
2020-04-01 17:19:56 -07:00
55511004d1 Resolve errors in perfkernel for Windows (#16031)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16031

1. MSVC only has _mm_prefetch(const char*, int). Fixed in both python codegen and C++ files.
2. uint32_t in "cvtsh_ss_bugfix.h" requires "#include <cstdint>".
3. Some files use gflags headers. Add dependency via c10.
4. Isolate arch flags with interface library and private compile options.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15753

Reviewed By: dskhudia

Differential Revision: D13636233

Pulled By: jspark1105

fbshipit-source-id: cdcbd4240e07b749554a2a5676c11af88f23c31d
2019-01-16 21:51:00 -08:00
1e0eab5df8 minimize header file includes from _avx2.cc (#14950)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14950

Minimize the number of headers included from _avx2.cc files to avoid accidental compilation of functions defined the header files reused by other translation units that can lead to illegal instruction errors.

Reviewed By: dskhudia

Differential Revision: D13394483

fbshipit-source-id: 67149a6fb51f7f047e745bfe395cb6dd4ae7c1ae
2018-12-13 00:18:11 -08:00
0573ef664e include avx512vl to avx512 code path (#14733)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14733

We often also want to use AVX512VL instruction sets.
We already included AVX512F, AVX512DQ.
Skylake also has AVX512BW, AVX512CD we may want to later.

Reviewed By: duc0

Differential Revision: D13317282

fbshipit-source-id: 82c8e401d82d5c3a5452fb4ccb6e5cb88d242bda
2018-12-05 00:50:51 -08:00
b5181ba1df add avx512 option (but no avx512 kernel yet) (#14664)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14664

This diff just adds a framework to add avx512 kernels.
Please be really really careful about using avx512 kernels unless you're convinced using avx512 will bring good enough *overall* speedups because it can backfire because of cpu frequency going down.

Reviewed By: duc0

Differential Revision: D13281944

fbshipit-source-id: 04fce8619c63f814944b727a99fbd7d35538eac6
2018-12-03 12:18:19 -08:00
4b86a215ca moving simd adagrad code to perfkernels (#13549)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/13549

caffe2/perfkernels has a nice framework to switch btw implementations optimized for different instructions at runtime.
This can be a good preparation to implement avx512 adagrad kernels.

Reviewed By: hyuen

Differential Revision: D12882872

fbshipit-source-id: a8f0419f6a9fd4e9b864c454dad0a80db267190c
2018-11-11 00:20:39 -08:00
1d5780d42c Remove Apache headers from source.
* LICENSE file contains details, so removing from individual source files.
2018-03-27 13:10:18 -07:00
8286ce1e3a Re-license to Apache
Summary: Closes https://github.com/caffe2/caffe2/pull/1260

Differential Revision: D5906739

Pulled By: Yangqing

fbshipit-source-id: e482ba9ba60b5337d9165f28f7ec68d4518a0902
2017-09-28 16:22:00 -07:00
eccddbc204 vectorized typed axpy implementation
Summary:
This adds an example for vectorized typed axpy implementation under
perfkernels.

Reviewed By: dzhulgakov

Differential Revision: D5479258

fbshipit-source-id: 469e6c8aaf2c12cdf0025bc867eb9d4cab84184f
2017-07-25 12:08:27 -07:00