Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14733
We often also want to use AVX512VL instruction sets.
We already included AVX512F, AVX512DQ.
Skylake also has AVX512BW, AVX512CD we may want to later.
Reviewed By: duc0
Differential Revision: D13317282
fbshipit-source-id: 82c8e401d82d5c3a5452fb4ccb6e5cb88d242bda
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14664
This diff just adds a framework to add avx512 kernels.
Please be really really careful about using avx512 kernels unless you're convinced using avx512 will bring good enough *overall* speedups because it can backfire because of cpu frequency going down.
Reviewed By: duc0
Differential Revision: D13281944
fbshipit-source-id: 04fce8619c63f814944b727a99fbd7d35538eac6
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/13549
caffe2/perfkernels has a nice framework to switch btw implementations optimized for different instructions at runtime.
This can be a good preparation to implement avx512 adagrad kernels.
Reviewed By: hyuen
Differential Revision: D12882872
fbshipit-source-id: a8f0419f6a9fd4e9b864c454dad0a80db267190c
Summary:
This was used as a convenient way for us to convert c1 models. Now that conversion is more or less done, we should probably require any users who need to convert c1 models to explicitly install c1. This PR removes the explicit c1 proto (which was copied from c1) in favor of explicit installation.
Note that caffe_translator would still work properly, only difference is that now users need to install c1 separately.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10380
Differential Revision: D9267981
Pulled By: Yangqing
fbshipit-source-id: a6ce5d9463e6567976da83f2d08b2c3d94d14390
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9350
Re-apply #9270
Breaking this out of #8338
This takes care of the Eigen failure we saw on Mac CUDA builds when BUILD_CAFFE2 and BUILD_ATEN were removed. Fix is to isolate Eigen from headers included by cu files and processed by nvcc. This was worked on with smessmer.
Reviewed By: mingzhe09088
Differential Revision: D8794431
fbshipit-source-id: de656334af46c697802073f8e8d9a6aeb9ca65a7
Summary:
Breaking this out of #8338
This takes care of the Eigen failure we saw on Mac CUDA builds when BUILD_CAFFE2 and BUILD_ATEN were removed. Fix is to isolate Eigen from headers included by cu files and processed by nvcc. This was worked on with smessmer.
cc mingzhe09088 smessmer BIT-silence Yangqing
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9270
Reviewed By: mingzhe09088
Differential Revision: D8768025
Pulled By: orionr
fbshipit-source-id: 5b34017aeb67e35a1b5938d962181ccd4cd37591
Summary:
Updates the perfkernel codebase to implement embedding lookup for our new fused storage format, where each row in the data matrix stores the quantized values *and* the scale and bias.
msmelyan see this as my best-effort attempt at updating the perfkernel stuff for the fused storage. Let me know if any of this is grossly wrong. I also don't know if we need to update any of the prefetching operations or something like that.
Note that we have to keep the old code around for a bit until we get rid of the old operations with separate `scale_bias` storage.
Reviewed By: kennyhorror
Differential Revision: D6710843
fbshipit-source-id: b485ef2389f526c5db1260cac9d4be3fc8df0979
Summary: 8 bytes is 64 bits. Fixes out of range access caught by ASAN
Reviewed By: Yangqing
Differential Revision: D6219576
fbshipit-source-id: f7c418b12fa211890abcb5aef800bd456390b73a
Summary: Before the boundary checking was happening after the first access for 8bit ops.
Reviewed By: Yangqing
Differential Revision: D6206753
fbshipit-source-id: 07ab240cae8c67b3048f03aa79af0b6399b9940b
Summary:
Adding uint8 support for to code generator for and high-performance emebding look-up kernels, supporting
Sum, WeightedSum, and Mean reducers. Added number of unit tests to test these operators.
Performance Results
===================
Performance results are below for old code, sparse_lengths_sum_benchmark.old.par, that uses
code in lengths_reducer_rowwise_8bit_ops.h, and our new code, optimized via code generator,
sparse_lengths_sum_benchmark.new.par. Block size was 128 in all cases.
[root@fblearner001.01.ftw1 /home/msmelyan]# ./sparse_lengths_sum_benchmark.old.par --iteration 10000 --dtype uint8
I0912 02:49:58.773259 2640913 net_simple.cc:162] Time per operator type:
I0912 02:49:58.773264 2640913 net_simple.cc:171] 0.75769 SparseLengthsSum8BitsRowwise
[root@fblearner001.01.ftw1 /home/msmelyan]# ./sparse_lengths_sum_benchmark.new.par --iteration 10000 --dtype uint8
I0912 02:50:33.981832 2642102 net_simple.cc:162] Time per operator type:
I0912 02:50:33.981837 2642102 net_simple.cc:171] 0.233322 SparseLengthsSum8BitsRowwise
[root@fblearner001.01.ftw1 /home/msmelyan]# ./sparse_lengths_sum_benchmark.new.par --iteration 10000 --dtype float16
I0912 02:51:26.748972 2643925 net_simple.cc:162] Time per operator type:
I0912 02:51:26.748977 2643925 net_simple.cc:171] 0.106591 SparseLengthsSum
[root@fblearner001.01.ftw1 /home/msmelyan]# ./sparse_lengths_sum_benchmark.new.par --iteration 10000 --dtype float
I0913 01:39:22.372238 1076874 net_simple.cc:162] Time per operator type:
I0913 01:39:22.372244 1076874 net_simple.cc:171] 0.211041 SparseLengthsSum
Analysis
========
Our optimized generated code is ~3.5x faster than original code in lengths_reducer_rowwise_8bit_ops.h
as shown below.
However, our uint8 is about 2x slower than float16 and is on par with float32. There are several reasons for that:
1. uint8 intrudoces extra instructions to multiply by bias and add scaling factors
2. In addition to emebding blocks, we are now also reading scale_bias.
For every pair of scale and bias, we bring entire cache line of
64 bytes, whiles only using 8 bytes. 128-wide uint8 input block only occupies 2 cache lines and hence
reading nearly entire extra cache lines of useless data adds to bandwidth wastage.
3. In addition, hardware prefetcher runs past the end of the input block and scale_bias
cache line, trying to prefetch more useless lines. This effect was characterised in Appendix section of
https://fb.facebook.com/notes/jason-lu/sparse-adagrad-performance-optimization-in-model-training/10214810437360961/
To get deeper insights into what is going on,
we isolated SparseLengthsSum and SparseLengthsSum8BitsRowwise codes, for float32, float16 and uint8,
into a microbenchmark, where we varried block size, while keeping table size constant (256MB)
block_size time(uint8) time(float16) time(float32)
64 0.19 0.09 0.17
128 0.12 0.09 0.17
256 0.70 0.09 0.14
1024 0.50 0.06 0.10
The pattern for block size of 64 and 128 is similar to what we observed in sparse_lengths_sum_benchmark.
However, we see that as block_size increases (for a fixed table size),
time to perform embeddings decreases quite drastically. For block_size of 256 and beyond, uint8 starts achieving
speedup over float16. Longer block better amortizes bandwidth wastage due to scale_bias and hardware prefetcher
running past the end of the block.
Reviewed By: kennyhorror
Differential Revision: D5870907
fbshipit-source-id: 445321b96f1b5801ef91f296f6063c35673ee11b
Summary:
Adding uint8 support for to code generator for and high-performance emebding look-up kernels, supporting
Sum, WeightedSum, and Mean reducers. Added number of unit tests to test these operators.
Performance Results
===================
Performance results are below for old code, sparse_lengths_sum_benchmark.old.par, that uses
code in lengths_reducer_rowwise_8bit_ops.h, and our new code, optimized via code generator,
sparse_lengths_sum_benchmark.new.par. Block size was 128 in all cases.
[root@fblearner001.01.ftw1 /home/msmelyan]# ./sparse_lengths_sum_benchmark.old.par --iteration 10000 --dtype uint8
I0912 02:49:58.773259 2640913 net_simple.cc:162] Time per operator type:
I0912 02:49:58.773264 2640913 net_simple.cc:171] 0.75769 SparseLengthsSum8BitsRowwise
[root@fblearner001.01.ftw1 /home/msmelyan]# ./sparse_lengths_sum_benchmark.new.par --iteration 10000 --dtype uint8
I0912 02:50:33.981832 2642102 net_simple.cc:162] Time per operator type:
I0912 02:50:33.981837 2642102 net_simple.cc:171] 0.233322 SparseLengthsSum8BitsRowwise
[root@fblearner001.01.ftw1 /home/msmelyan]# ./sparse_lengths_sum_benchmark.new.par --iteration 10000 --dtype float16
I0912 02:51:26.748972 2643925 net_simple.cc:162] Time per operator type:
I0912 02:51:26.748977 2643925 net_simple.cc:171] 0.106591 SparseLengthsSum
[root@fblearner001.01.ftw1 /home/msmelyan]# ./sparse_lengths_sum_benchmark.new.par --iteration 10000 --dtype float
I0913 01:39:22.372238 1076874 net_simple.cc:162] Time per operator type:
I0913 01:39:22.372244 1076874 net_simple.cc:171] 0.211041 SparseLengthsSum
Analysis
========
Our optimized generated code is ~3.5x faster than original code in lengths_reducer_rowwise_8bit_ops.h
as shown below.
However, our uint8 is about 2x slower than float16 and is on par with float32. There are several reasons for that:
1. uint8 intrudoces extra instructions to multiply by bias and add scaling factors
2. In addition to emebding blocks, we are now also reading scale_bias.
For every pair of scale and bias, we bring entire cache line of
64 bytes, whiles only using 8 bytes. 128-wide uint8 input block only occupies 2 cache lines and hence
reading nearly entire extra cache lines of useless data adds to bandwidth wastage.
3. In addition, hardware prefetcher runs past the end of the input block and scale_bias
cache line, trying to prefetch more useless lines. This effect was characterised in Appendix section of
https://fb.facebook.com/notes/jason-lu/sparse-adagrad-performance-optimization-in-model-training/10214810437360961/
To get deeper insights into what is going on,
we isolated SparseLengthsSum and SparseLengthsSum8BitsRowwise codes, for float32, float16 and uint8,
into a microbenchmark, where we varried block size, while keeping table size constant (256MB)
block_size time(uint8) time(float16) time(float32)
64 0.19 0.09 0.17
128 0.12 0.09 0.17
256 0.70 0.09 0.14
1024 0.50 0.06 0.10
The pattern for block size of 64 and 128 is similar to what we observed in sparse_lengths_sum_benchmark.
However, we see that as block_size increases (for a fixed table size),
time to perform embeddings decreases quite drastically. For block_size of 256 and beyond, uint8 starts achieving
speedup over float16. Longer block better amortizes bandwidth wastage due to scale_bias and hardware prefetcher
running past the end of the block.
Reviewed By: dzhulgakov
Differential Revision: D5824641
fbshipit-source-id: 3a5c020294d84874da78c6943e596423393473d6
Summary:
Using file(WRITE) caused the file to be rewritten for every CMake
reconfigure, which was causing unnecessary full rebuilds of the project
even when no source files changed.
The new strategy has the added benefit of enforcing that the macros.h file
is always generated correctly. When the main project relies on this
header for macro definitions (instead of relying on add_definitions()),
we can be more confident that the project will build correctly when used
as a library (which is the whole point of the macros.h file).
Upsides:
* No more unnecessary rebuilds
* Higher confidence that the project will compile properly as a third-party library
Downsides:
* Developers need to add an entry to `macros.h.in` whenever they would have added a new definition with `add_definitions()`
Closes https://github.com/caffe2/caffe2/pull/1103
Differential Revision: D5680367
Pulled By: Yangqing
fbshipit-source-id: 4db29c28589efda1b6a3f5f88752e3984260a0f2
Summary:
Based on discussion with Misha we're going to go for code-generation for all possible variants:
AVX2/AVX512 (eventually)
embedding type: float16, float32
index type: int32, int64
reducer: sum, weighted sum, mean (with scaling by lengths)
block size: 32, 64, 128
From some simple testing full-loop fusion with prefetching (as opposed to TypedAxpy) gives at least 1.5x performance win, so it is justified.
This just adds scaffolding for perfkernels for the embedding lookup subfunction.
I haven't actually moved the current implementation, because it's more work to refactor current macroses/templates, it's easier and more extensible to do codegen.
Scaffolding is a bit ugly because we don't want to pass templates across translation units and thus it requires explicit names of types in function names. Better suggestions are welcomed.
msmelyan - you'd pretty much need to generate appropriate embedding_lookup_avx2.cc
Reviewed By: Yangqing
Differential Revision: D5505887
fbshipit-source-id: ece489d4fd36e7ddbe71efb890f48ab38acaeaec
Summary:
This adds an example for vectorized typed axpy implementation under
perfkernels.
Reviewed By: dzhulgakov
Differential Revision: D5479258
fbshipit-source-id: 469e6c8aaf2c12cdf0025bc867eb9d4cab84184f
Summary:
(1) Wrote up length reducer operators from the original dispatcher
implementation under segment_reduction_op.cc. Note that this does not
change the fp16 version now.
(2) created subfolder perfkernels for potential different backends, with
scaffolding done.
(3) provided the vanilla fp16 implementation, so that currently the default
implementation will support fp16 (very slow) right now. This sets up the
fp16 benchmarking capability after D5477844.
Next step is actually to implement the faster versions. The goal of this diff
is mainly so that Misha can plug in his custom implementations more easily.
Reviewed By: dzhulgakov
Differential Revision: D5479056
fbshipit-source-id: bba30dc0d892b8e2cdfc825034fdfb7bd22a1726