pytorch

mirror of https://github.com/pytorch/pytorch.git synced 2025-10-20 21:14:14 +08:00

Author	SHA1	Message	Date
Jongsoo Park	0573ef664e	include avx512vl to avx512 code path (#14733 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/14733 We often also want to use AVX512VL instruction sets. We already included AVX512F, AVX512DQ. Skylake also has AVX512BW, AVX512CD we may want to later. Reviewed By: duc0 Differential Revision: D13317282 fbshipit-source-id: 82c8e401d82d5c3a5452fb4ccb6e5cb88d242bda	2018-12-05 00:50:51 -08:00
Jongsoo Park	b5181ba1df	add avx512 option (but no avx512 kernel yet) (#14664 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/14664 This diff just adds a framework to add avx512 kernels. Please be really really careful about using avx512 kernels unless you're convinced using avx512 will bring good enough overall speedups because it can backfire because of cpu frequency going down. Reviewed By: duc0 Differential Revision: D13281944 fbshipit-source-id: 04fce8619c63f814944b727a99fbd7d35538eac6	2018-12-03 12:18:19 -08:00
Jongsoo Park	5c89190340	inline adagrad functions (#14194 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/14194 Inline some of perfkernels/adagrad.h functions for better performance Reviewed By: hyuen Differential Revision: D13096351 fbshipit-source-id: b4da8053278d585eabc5389b8a8dcae0f253b413	2018-12-02 20:23:02 -08:00
Jongsoo Park	c784f847de	fix sparse_adagrad param_size overflow error (#14049 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/14049 param_size should be passed as int64_t Reviewed By: hyuen Differential Revision: D13090511 fbshipit-source-id: 7892d315d7c82c7d7ca103fb36d30cdf1fe24785	2018-11-16 18:53:32 -08:00
Jongsoo Park	4b86a215ca	moving simd adagrad code to perfkernels (#13549 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/13549 caffe2/perfkernels has a nice framework to switch btw implementations optimized for different instructions at runtime. This can be a good preparation to implement avx512 adagrad kernels. Reviewed By: hyuen Differential Revision: D12882872 fbshipit-source-id: a8f0419f6a9fd4e9b864c454dad0a80db267190c	2018-11-11 00:20:39 -08:00
Yangqing Jia	a6f1ae7f20	set up c10 scaffolding. Move macros proper first. Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/11939 Reviewed By: orionr, dzhulgakov Differential Revision: D10004629 Pulled By: Yangqing fbshipit-source-id: ba50a96820d35c7922d81c78c4cbe849c85c251c	2018-09-24 11:09:59 -07:00
Christian Puhrsch	a6630e25af	Remove many caffe2::TIndex and replace them with int64_t (#11943 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/11943 See title Reviewed By: ezyang Differential Revision: D9992645 fbshipit-source-id: e8f80d6ea762971513e5e8072975ceea53e1f11a	2018-09-22 18:11:04 -07:00
Roy Li	30521a37ad	codemod: caffe::float16 -> at::Half (#11785 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/11785 Replace each instead of float16 with Half. Reviewed By: Yangqing Differential Revision: D9892158 fbshipit-source-id: b9225ca7bd5c84fd1c04a9d24b026c8b6cbff120	2018-09-20 18:55:19 -07:00
Wei Wen	0e30fa6f3c	Faster random number generation in fused_rowwise_random_quantization_ops (#10634 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/10634 ``` Trying example: test_speed_of_rand_quantization(self=<caffe2.caffe2.python.operator_test.rand_quantization_op_speed_test.TestSpeedFloatToFusedRandRowwiseQuantized testMethod=test_speed_of_rand_quantization>, bitwidth_=2, random_=True, data_shape_=array([1024, 1224]), gc=, dc=[, device_type: 1]) Sub+Scale+Sum time: 1.9944190979003908 ms Quantizing time: 2.080512046813965 ms (1.0431669296609765X) De-quantizing time: 0.7375001907348633 ms (0.36978195380863577X) ``` ``` Trying example: test_speed_of_rand_quantization(self=<caffe2.caffe2.python.operator_test.rand_quantization_op_speed_test.TestSpeedFloatToFusedRandRowwiseQuantized testMethod=test_speed_of_rand_quantization>, bitwidth_=1, random_=True, data_shape_=array([1024, 1224]), gc=device_type: 1, dc=[, device_type: 1]) Sub+Scale+Sum time: 1.6691923141479492 ms Quantizing time: 7.500243186950684 ms (4.493336761366071X) De-quantizing time: 1.1209726333618164 ms (0.6715658967876477X) ``` Reviewed By: jspark1105 Differential Revision: D8849770 fbshipit-source-id: 2bb2bac7e633f647f38e419ce980b8958f3bcae2	2018-08-22 13:15:46 -07:00
Yangqing Jia	40109b16d0	Remove caffe1 specific proto (#10380 ) Summary: This was used as a convenient way for us to convert c1 models. Now that conversion is more or less done, we should probably require any users who need to convert c1 models to explicitly install c1. This PR removes the explicit c1 proto (which was copied from c1) in favor of explicit installation. Note that caffe_translator would still work properly, only difference is that now users need to install c1 separately. Pull Request resolved: https://github.com/pytorch/pytorch/pull/10380 Differential Revision: D9267981 Pulled By: Yangqing fbshipit-source-id: a6ce5d9463e6567976da83f2d08b2c3d94d14390	2018-08-10 11:10:26 -07:00
Xiaomeng Yang	57d2d4bcff	Optimize reduce ops for 2d and 3d (#9992 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/9992 Optimize reduce ops for 2d and 3d Reviewed By: houseroad Differential Revision: D9042505 fbshipit-source-id: 62af2125aa6439106293e59bdf6a2b920792fd2d	2018-08-04 13:53:58 -07:00
Orion Reblitz-Richardson	7f33ec55b2	Fix Eigen issue on OS X with CUDA and nvcc compile (#9350 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/9350 Re-apply #9270 Breaking this out of #8338 This takes care of the Eigen failure we saw on Mac CUDA builds when BUILD_CAFFE2 and BUILD_ATEN were removed. Fix is to isolate Eigen from headers included by cu files and processed by nvcc. This was worked on with smessmer. Reviewed By: mingzhe09088 Differential Revision: D8794431 fbshipit-source-id: de656334af46c697802073f8e8d9a6aeb9ca65a7	2018-07-11 14:00:05 -07:00
Mike Kelley	8e6e8098ce	Revert D8768025: [pytorch][PR] Fix Eigen issue on OS X with CUDA and nvcc compile Differential Revision: D8768025 Original commit changeset: 5b34017aeb67 fbshipit-source-id: 6ec892ff483bb9d966eb7138eadc77443972c8f8	2018-07-10 10:24:43 -07:00
Orion Reblitz-Richardson	bbeae24145	Fix Eigen issue on OS X with CUDA and nvcc compile (#9270 ) Summary: Breaking this out of #8338 This takes care of the Eigen failure we saw on Mac CUDA builds when BUILD_CAFFE2 and BUILD_ATEN were removed. Fix is to isolate Eigen from headers included by cu files and processed by nvcc. This was worked on with smessmer. cc mingzhe09088 smessmer BIT-silence Yangqing Pull Request resolved: https://github.com/pytorch/pytorch/pull/9270 Reviewed By: mingzhe09088 Differential Revision: D8768025 Pulled By: orionr fbshipit-source-id: 5b34017aeb67e35a1b5938d962181ccd4cd37591	2018-07-10 09:25:42 -07:00
Jongsoo Park	3300e21d52	Add SparseLengthsPositionalWeightedSum operator that fuses SparseLengthsWeightedSum, LengthsRangeFill, and Gather add SparseLengthsPositionalWeightedSum operator that fuses SparseLengthsWeightedSum, LengthsRangeFill, and Gather	2018-03-27 18:10:39 -07:00
Orion Reblitz-Richardson	1d5780d42c	Remove Apache headers from source. * LICENSE file contains details, so removing from individual source files.	2018-03-27 13:10:18 -07:00
Peter Goldsborough	d401c26d63	Add FusedEmbeddingLookup Summary: Updates the perfkernel codebase to implement embedding lookup for our new fused storage format, where each row in the data matrix stores the quantized values and the scale and bias. msmelyan see this as my best-effort attempt at updating the perfkernel stuff for the fused storage. Let me know if any of this is grossly wrong. I also don't know if we need to update any of the prefetching operations or something like that. Note that we have to keep the old code around for a bit until we get rid of the old operations with separate `scale_bias` storage. Reviewed By: kennyhorror Differential Revision: D6710843 fbshipit-source-id: b485ef2389f526c5db1260cac9d4be3fc8df0979	2018-01-19 15:44:34 -08:00
Pieter Noordhuis	cdd48a8575	Fix typo in clang ifdef to fix clang 3.9 build Summary: This prevented building with clang 3.9. Closes https://github.com/caffe2/caffe2/pull/1565 Differential Revision: D6472567 Pulled By: pietern fbshipit-source-id: 361c3f9e85237ca0328e12eb23309bc4a3e11556	2017-12-03 22:51:45 -08:00
Pieter Noordhuis	335c7dc681	Fix perfkernel compile error on clang 3.8 Summary: Closes #1483. Closes https://github.com/caffe2/caffe2/pull/1489 Reviewed By: bddppq Differential Revision: D6376107 Pulled By: pietern fbshipit-source-id: 892f74d67629609ed82c991cfd94508cf8e23c29	2017-11-20 16:17:51 -08:00
caozhong	e8abfd359a	Limit this fix to apple clang only Summary: Use "__apple_build_version__" macro to distinguish Apple's Clang while brew installed LLVM will compile caffe2 without trouble. Closes https://github.com/caffe2/caffe2/pull/1461 Differential Revision: D6316861 Pulled By: Yangqing fbshipit-source-id: f7a08cdd8822b197a93aa11dc8f28ef5cd738eee	2017-11-13 14:33:30 -08:00
Dmytro Dzhulgakov	8548dd2486	Fix intrinsic in perf kerneles for int8 Summary: 8 bytes is 64 bits. Fixes out of range access caught by ASAN Reviewed By: Yangqing Differential Revision: D6219576 fbshipit-source-id: f7c418b12fa211890abcb5aef800bd456390b73a	2017-11-03 05:19:58 -07:00
Dmytro Dzhulgakov	583bc63c98	Fix boundary checking in 8-bit sparselengthssum ops Summary: Before the boundary checking was happening after the first access for 8bit ops. Reviewed By: Yangqing Differential Revision: D6206753 fbshipit-source-id: 07ab240cae8c67b3048f03aa79af0b6399b9940b	2017-11-03 05:19:57 -07:00
Yangqing Jia	8286ce1e3a	Re-license to Apache Summary: Closes https://github.com/caffe2/caffe2/pull/1260 Differential Revision: D5906739 Pulled By: Yangqing fbshipit-source-id: e482ba9ba60b5337d9165f28f7ec68d4518a0902	2017-09-28 16:22:00 -07:00
Misha Smelyanskiy	2cbb4167c1	Adding uint8 support for to code generator for and high-performance emebding look-up kernels, supporting Summary: Adding uint8 support for to code generator for and high-performance emebding look-up kernels, supporting Sum, WeightedSum, and Mean reducers. Added number of unit tests to test these operators. Performance Results =================== Performance results are below for old code, sparse_lengths_sum_benchmark.old.par, that uses code in lengths_reducer_rowwise_8bit_ops.h, and our new code, optimized via code generator, sparse_lengths_sum_benchmark.new.par. Block size was 128 in all cases. [root@fblearner001.01.ftw1 /home/msmelyan]# ./sparse_lengths_sum_benchmark.old.par --iteration 10000 --dtype uint8 I0912 02:49:58.773259 2640913 net_simple.cc:162] Time per operator type: I0912 02:49:58.773264 2640913 net_simple.cc:171] 0.75769 SparseLengthsSum8BitsRowwise [root@fblearner001.01.ftw1 /home/msmelyan]# ./sparse_lengths_sum_benchmark.new.par --iteration 10000 --dtype uint8 I0912 02:50:33.981832 2642102 net_simple.cc:162] Time per operator type: I0912 02:50:33.981837 2642102 net_simple.cc:171] 0.233322 SparseLengthsSum8BitsRowwise [root@fblearner001.01.ftw1 /home/msmelyan]# ./sparse_lengths_sum_benchmark.new.par --iteration 10000 --dtype float16 I0912 02:51:26.748972 2643925 net_simple.cc:162] Time per operator type: I0912 02:51:26.748977 2643925 net_simple.cc:171] 0.106591 SparseLengthsSum [root@fblearner001.01.ftw1 /home/msmelyan]# ./sparse_lengths_sum_benchmark.new.par --iteration 10000 --dtype float I0913 01:39:22.372238 1076874 net_simple.cc:162] Time per operator type: I0913 01:39:22.372244 1076874 net_simple.cc:171] 0.211041 SparseLengthsSum Analysis ======== Our optimized generated code is ~3.5x faster than original code in lengths_reducer_rowwise_8bit_ops.h as shown below. However, our uint8 is about 2x slower than float16 and is on par with float32. There are several reasons for that: 1. uint8 intrudoces extra instructions to multiply by bias and add scaling factors 2. In addition to emebding blocks, we are now also reading scale_bias. For every pair of scale and bias, we bring entire cache line of 64 bytes, whiles only using 8 bytes. 128-wide uint8 input block only occupies 2 cache lines and hence reading nearly entire extra cache lines of useless data adds to bandwidth wastage. 3. In addition, hardware prefetcher runs past the end of the input block and scale_bias cache line, trying to prefetch more useless lines. This effect was characterised in Appendix section of https://fb.facebook.com/notes/jason-lu/sparse-adagrad-performance-optimization-in-model-training/10214810437360961/ To get deeper insights into what is going on, we isolated SparseLengthsSum and SparseLengthsSum8BitsRowwise codes, for float32, float16 and uint8, into a microbenchmark, where we varried block size, while keeping table size constant (256MB) block_size time(uint8) time(float16) time(float32) 64 0.19 0.09 0.17 128 0.12 0.09 0.17 256 0.70 0.09 0.14 1024 0.50 0.06 0.10 The pattern for block size of 64 and 128 is similar to what we observed in sparse_lengths_sum_benchmark. However, we see that as block_size increases (for a fixed table size), time to perform embeddings decreases quite drastically. For block_size of 256 and beyond, uint8 starts achieving speedup over float16. Longer block better amortizes bandwidth wastage due to scale_bias and hardware prefetcher running past the end of the block. Reviewed By: kennyhorror Differential Revision: D5870907 fbshipit-source-id: 445321b96f1b5801ef91f296f6063c35673ee11b	2017-09-21 14:50:43 -07:00
Yangqing Jia	06b7a9e0f6	Backed out changeset 3a5c020294d8 Summary: Broke CAFFE2_HYPOTHESIS_PROFILE=debug buck test //caffe2/caffe2/python:lengths_reducer_rowwise_8bit_ops_test Reviewed By: kennyhorror Differential Revision: D5867880 fbshipit-source-id: 80c6f23eccb59b74be4a7258b4f193d79f814c3f	2017-09-19 17:54:18 -07:00
Misha Smelyanskiy	b468ffe6d1	Adding uint8 support for to code generator for and high-performance emebding look-up kernels, supporting Summary: Adding uint8 support for to code generator for and high-performance emebding look-up kernels, supporting Sum, WeightedSum, and Mean reducers. Added number of unit tests to test these operators. Performance Results =================== Performance results are below for old code, sparse_lengths_sum_benchmark.old.par, that uses code in lengths_reducer_rowwise_8bit_ops.h, and our new code, optimized via code generator, sparse_lengths_sum_benchmark.new.par. Block size was 128 in all cases. [root@fblearner001.01.ftw1 /home/msmelyan]# ./sparse_lengths_sum_benchmark.old.par --iteration 10000 --dtype uint8 I0912 02:49:58.773259 2640913 net_simple.cc:162] Time per operator type: I0912 02:49:58.773264 2640913 net_simple.cc:171] 0.75769 SparseLengthsSum8BitsRowwise [root@fblearner001.01.ftw1 /home/msmelyan]# ./sparse_lengths_sum_benchmark.new.par --iteration 10000 --dtype uint8 I0912 02:50:33.981832 2642102 net_simple.cc:162] Time per operator type: I0912 02:50:33.981837 2642102 net_simple.cc:171] 0.233322 SparseLengthsSum8BitsRowwise [root@fblearner001.01.ftw1 /home/msmelyan]# ./sparse_lengths_sum_benchmark.new.par --iteration 10000 --dtype float16 I0912 02:51:26.748972 2643925 net_simple.cc:162] Time per operator type: I0912 02:51:26.748977 2643925 net_simple.cc:171] 0.106591 SparseLengthsSum [root@fblearner001.01.ftw1 /home/msmelyan]# ./sparse_lengths_sum_benchmark.new.par --iteration 10000 --dtype float I0913 01:39:22.372238 1076874 net_simple.cc:162] Time per operator type: I0913 01:39:22.372244 1076874 net_simple.cc:171] 0.211041 SparseLengthsSum Analysis ======== Our optimized generated code is ~3.5x faster than original code in lengths_reducer_rowwise_8bit_ops.h as shown below. However, our uint8 is about 2x slower than float16 and is on par with float32. There are several reasons for that: 1. uint8 intrudoces extra instructions to multiply by bias and add scaling factors 2. In addition to emebding blocks, we are now also reading scale_bias. For every pair of scale and bias, we bring entire cache line of 64 bytes, whiles only using 8 bytes. 128-wide uint8 input block only occupies 2 cache lines and hence reading nearly entire extra cache lines of useless data adds to bandwidth wastage. 3. In addition, hardware prefetcher runs past the end of the input block and scale_bias cache line, trying to prefetch more useless lines. This effect was characterised in Appendix section of https://fb.facebook.com/notes/jason-lu/sparse-adagrad-performance-optimization-in-model-training/10214810437360961/ To get deeper insights into what is going on, we isolated SparseLengthsSum and SparseLengthsSum8BitsRowwise codes, for float32, float16 and uint8, into a microbenchmark, where we varried block size, while keeping table size constant (256MB) block_size time(uint8) time(float16) time(float32) 64 0.19 0.09 0.17 128 0.12 0.09 0.17 256 0.70 0.09 0.14 1024 0.50 0.06 0.10 The pattern for block size of 64 and 128 is similar to what we observed in sparse_lengths_sum_benchmark. However, we see that as block_size increases (for a fixed table size), time to perform embeddings decreases quite drastically. For block_size of 256 and beyond, uint8 starts achieving speedup over float16. Longer block better amortizes bandwidth wastage due to scale_bias and hardware prefetcher running past the end of the block. Reviewed By: dzhulgakov Differential Revision: D5824641 fbshipit-source-id: 3a5c020294d84874da78c6943e596423393473d6	2017-09-19 10:50:09 -07:00
Misha Smelyanskiy	080fab8f6c	Code generator for and high-performance emebding look-up kernels, supporting Summary: Code generator for and high-performance emebding look-up kernels, supporting Sum, WeightedSum, and Mean reducers. Achieve at least 1.5x speedup on float and over 2x speedup for float16, compared to existing code These are results on Broadwell, using sparse_lengths_sum_benchmar.par benchmark Old ============== [root@fblearner001.01.ftw1 /home/msmelyan]# numactl -m 0 -C 0 ./sparse_lengths_sum_benchmark.par --iteration 10000 Preparing lookup table. 2017-08-08 00:10:23.101848 Preparation finished. 2017-08-08 00:10:27.955680 I0808 00:10:27.955732 30700 net.cc:177] Starting benchmark. I0808 00:10:27.955759 30700 net.cc:178] Running warmup runs. I0808 00:10:27.956367 30700 net.cc:188] Main runs. I0808 00:10:31.839035 30700 net.cc:199] Main run finished. Milliseconds per iter: 0.388264. Iters per second: 2575.56 I0808 00:10:35.704169 30700 net.cc:233] Operator #0 (indices, Python) 0.0583264 ms/iter I0808 00:10:35.704210 30700 net.cc:233] Operator #1 (Y, SparseLengthsSum) 0.327694 ms/iter I0808 00:10:35.704213 30700 net.cc:237] Time per operator type: I0808 00:10:35.704217 30700 net.cc:246] 0.327694 SparseLengthsSum I0808 00:10:35.704221 30700 net.cc:246] 0.0583264 Python [root@fblearner001.01.ftw1 /home/msmelyan]# numactl -m 0 -C 0 ./sparse_lengths_sum_benchmark.par --iteration 10000 --dtype float16 Preparing lookup table. 2017-08-08 00:10:59.047159 Preparation finished. 2017-08-08 00:11:05.140565 I0808 00:11:05.140612 31725 net.cc:177] Starting benchmark. I0808 00:11:05.140635 31725 net.cc:178] Running warmup runs. I0808 00:11:05.141104 31725 net.cc:188] Main runs. I0808 00:11:08.371510 31725 net.cc:199] Main run finished. Milliseconds per iter: 0.323039. Iters per second: 3095.6 I0808 00:11:11.671450 31725 net.cc:233] Operator #0 (indices, Python) 0.0609876 ms/iter I0808 00:11:11.671489 31725 net.cc:233] Operator #1 (Y, SparseLengthsSum) 0.26856 ms/iter I0808 00:11:11.671494 31725 net.cc:237] Time per operator type: I0808 00:11:11.671497 31725 net.cc:246] 0.26856 SparseLengthsSum I0808 00:11:11.671500 31725 net.cc:246] 0.0609876 Python New (Misha's) ============== [root@fblearner001.01.ftw1 /home/msmelyan]# numactl -m 0 -C 0 ./sparse_lengths_sum_benchmark.par --iteration 10000 Preparing lookup table. 2017-08-07 23:44:55.897748 Preparation finished. 2017-08-07 23:45:00.708896 I0807 23:45:00.708945 4178361 net.cc:177] Starting benchmark. I0807 23:45:00.708971 4178361 net.cc:178] Running warmup runs. I0807 23:45:00.709444 4178361 net.cc:188] Main runs. I0807 23:45:03.608551 4178361 net.cc:199] Main run finished. Milliseconds per iter: 0.289909. Iters per second: 3449.36 I0807 23:45:06.536182 4178361 net.cc:233] Operator #0 (indices, Python) 0.0572399 ms/iter I0807 23:45:06.536224 4178361 net.cc:233] Operator #1 (Y, SparseLengthsSum) 0.23512 ms/iter I0807 23:45:06.536228 4178361 net.cc:237] Time per operator type: I0807 23:45:06.536232 4178361 net.cc:246] 0.23512 SparseLengthsSum I0807 23:45:06.536236 4178361 net.cc:246] 0.0572399 Python [root@fblearner001.01.ftw1 /home/msmelyan]# numactl -m 0 -C 0 ./sparse_lengths_sum_benchmark.par --iteration 10000 --dtype float16 Preparing lookup table. 2017-08-07 23:45:17.191579 Preparation finished. 2017-08-07 23:45:23.173668 I0807 23:45:23.173715 4179316 net.cc:177] Starting benchmark. I0807 23:45:23.173743 4179316 net.cc:178] Running warmup runs. I0807 23:45:23.174090 4179316 net.cc:188] Main runs. I0807 23:45:24.939749 4179316 net.cc:199] Main run finished. Milliseconds per iter: 0.176564. Iters per second: 5663.67 I0807 23:45:26.698885 4179316 net.cc:233] Operator #0 (indices, Python) 0.0557303 ms/iter I0807 23:45:26.698923 4179316 net.cc:233] Operator #1 (Y, SparseLengthsSum) 0.119794 ms/iter I0807 23:45:26.698927 4179316 net.cc:237] Time per operator type: I0807 23:45:26.698931 4179316 net.cc:246] 0.119794 SparseLengthsSum I0807 23:45:26.698935 4179316 net.cc:246] 0.0557303 Python Reviewed By: salexspb Differential Revision: D5582172 fbshipit-source-id: d71f5a55580b734a51b8f30852b75f379acfdaf2	2017-08-30 16:22:11 -07:00
Luke Yeager	c1356216a2	cmake: generate macros.h with configure_file() Summary: Using file(WRITE) caused the file to be rewritten for every CMake reconfigure, which was causing unnecessary full rebuilds of the project even when no source files changed. The new strategy has the added benefit of enforcing that the macros.h file is always generated correctly. When the main project relies on this header for macro definitions (instead of relying on add_definitions()), we can be more confident that the project will build correctly when used as a library (which is the whole point of the macros.h file). Upsides: * No more unnecessary rebuilds * Higher confidence that the project will compile properly as a third-party library Downsides: * Developers need to add an entry to `macros.h.in` whenever they would have added a new definition with `add_definitions()` Closes https://github.com/caffe2/caffe2/pull/1103 Differential Revision: D5680367 Pulled By: Yangqing fbshipit-source-id: 4db29c28589efda1b6a3f5f88752e3984260a0f2	2017-08-22 14:22:36 -07:00
Jammy Zhou	b96c4e714b	Fix build failure on MacOS X with clang-800.0.42.1 Summary: Signed-off-by: Jammy Zhou <jammy.zhou@gmail.com> Closes https://github.com/caffe2/caffe2/pull/1047 Differential Revision: D5583196 Pulled By: Yangqing fbshipit-source-id: 7fe782b6caa14074573fbdacd68f50e16fb85e3f	2017-08-08 18:49:27 -07:00
Yangqing Jia	0b50e078d1	add proper build support for perfkernels Summary: Closes https://github.com/caffe2/caffe2/pull/972 Differential Revision: D5506606 Pulled By: Yangqing fbshipit-source-id: d9327e08fc1726bf9b20a8668d06a5be179f45d4	2017-08-02 23:17:04 -07:00
Dmytro Dzhulgakov	76bb054f2c	Scaffolding for perfkernels dispatch of embedding lookup Summary: Based on discussion with Misha we're going to go for code-generation for all possible variants: AVX2/AVX512 (eventually) embedding type: float16, float32 index type: int32, int64 reducer: sum, weighted sum, mean (with scaling by lengths) block size: 32, 64, 128 From some simple testing full-loop fusion with prefetching (as opposed to TypedAxpy) gives at least 1.5x performance win, so it is justified. This just adds scaffolding for perfkernels for the embedding lookup subfunction. I haven't actually moved the current implementation, because it's more work to refactor current macroses/templates, it's easier and more extensible to do codegen. Scaffolding is a bit ugly because we don't want to pass templates across translation units and thus it requires explicit names of types in function names. Better suggestions are welcomed. msmelyan - you'd pretty much need to generate appropriate embedding_lookup_avx2.cc Reviewed By: Yangqing Differential Revision: D5505887 fbshipit-source-id: ece489d4fd36e7ddbe71efb890f48ab38acaeaec	2017-07-30 12:34:23 -07:00
Yangqing Jia	d20e50a39f	fix perfkernels build Summary: Closes https://github.com/caffe2/caffe2/pull/967 Differential Revision: D5497418 Pulled By: Yangqing fbshipit-source-id: 171d7a3128ef6e54d409a8186a4f335439a82f68	2017-07-26 00:32:22 -07:00
Yangqing Jia	eccddbc204	vectorized typed axpy implementation Summary: This adds an example for vectorized typed axpy implementation under perfkernels. Reviewed By: dzhulgakov Differential Revision: D5479258 fbshipit-source-id: 469e6c8aaf2c12cdf0025bc867eb9d4cab84184f	2017-07-25 12:08:27 -07:00
Yangqing Jia	c2f2b5ad51	lengths_reducer_ops refactoring. Summary: (1) Wrote up length reducer operators from the original dispatcher implementation under segment_reduction_op.cc. Note that this does not change the fp16 version now. (2) created subfolder perfkernels for potential different backends, with scaffolding done. (3) provided the vanilla fp16 implementation, so that currently the default implementation will support fp16 (very slow) right now. This sets up the fp16 benchmarking capability after D5477844. Next step is actually to implement the faster versions. The goal of this diff is mainly so that Misha can plug in his custom implementations more easily. Reviewed By: dzhulgakov Differential Revision: D5479056 fbshipit-source-id: bba30dc0d892b8e2cdfc825034fdfb7bd22a1726	2017-07-25 12:08:26 -07:00

1 2 3

134 Commits