pytorch

mirror of https://github.com/pytorch/pytorch.git synced 2025-10-20 21:14:14 +08:00

Author	SHA1	Message	Date
Yuanyuan Chen	fdab48a7c1	Enable all PIE rules on ruff (#165814 ) This PR enables all PIE rules on ruff, there are already some enabled rules from this family, the new added rules are ``` PIE796 Enum contains duplicate value: {value} PIE808 Unnecessary start argument in range ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/165814 Approved by: https://github.com/ezyang	2025-10-18 07:36:18 +00:00
cyy	3c2324c64a	[2/N] Fix cppcoreguidelines-init-variables suppression (#146237 ) This PR removes all `cppcoreguidelines-init-variables` suppressions. Pull Request resolved: https://github.com/pytorch/pytorch/pull/146237 Approved by: https://github.com/ezyang	2025-06-19 23:26:42 +00:00
cyy	f9bf9057ef	Fix ruff warnings in caffe2 and functorch (#144182 ) In preparation for upgrading ruff config to py3.9. Pull Request resolved: https://github.com/pytorch/pytorch/pull/144182 Approved by: https://github.com/malfet	2025-01-04 04:15:01 +00:00
Xuehai Pan	b77406a9ec	[BE][CI] bump `ruff` to 0.8.4 (#143753 ) Changes: 1. Bump `ruff` from 0.7.4 to 0.8.4 2. Change `%`-formatted strings to f-string 3. Change arguments with the `__`-prefix to positional-only arguments with the `/` separator in function signature. Pull Request resolved: https://github.com/pytorch/pytorch/pull/143753 Approved by: https://github.com/Skylion007	2024-12-24 12:24:10 +00:00
Aaron Orenstein	dcfa7702c3	Flip default value for mypy disallow_untyped_defs [1/11] (#127838 ) See #127836 for details. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127838 Approved by: https://github.com/oulgen	2024-06-08 18:16:33 +00:00
haozhe.zhu	7cd6e6acad	add bf16 in fp32 out fast path for embedingbag in caffe2 perfkernel (#89198 ) Add BF16 in FP32 out kernel into Caffe2 emb perfkernels. And also update the python code-gen files to generate the kernel. The ut will be covered in the next PR(#89199) in this stack ( Tested by nn.EmbeddingBag with BF16 data type) Pull Request resolved: https://github.com/pytorch/pytorch/pull/89198 Approved by: https://github.com/jgong5, https://github.com/kit1980	2022-11-30 13:06:13 +00:00
Shen Li	1022443168	Revert D30279364: [codemod][lint][fbcode/c*] Enable BLACK by default Test Plan: revert-hammer Differential Revision: D30279364 (`b004307252`) Original commit changeset: c1ed77dfe43a fbshipit-source-id: eab50857675c51e0088391af06ec0ecb14e2347e	2021-08-12 11:45:01 -07:00
Zsolt Dollenstein	b004307252	[codemod][lint][fbcode/c*] Enable BLACK by default Test Plan: manual inspection & sandcastle Reviewed By: zertosh Differential Revision: D30279364 fbshipit-source-id: c1ed77dfe43a3bde358f92737cd5535ae5d13c9a	2021-08-12 10:58:35 -07:00
Qi Zhou	0ec717c830	Support int32 indices and offsets in nn.EmbeddingBag (#46758 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/46758 It's in general helpful to support int32 indices and offsets, especially when such tensors are large and need to be transferred to accelerator backends. Since it may not be very useful to support the combination of int32 indices and int64 offsets, here we enforce that these two must have the same type. Test Plan: unit tests Reviewed By: ngimel Differential Revision: D24470808 fbshipit-source-id: 94b8a1d0b7fc9fe3d128247aa042c04d7c227f0b	2020-11-03 23:33:50 -08:00
Bugra Akyildiz	27c7158166	Remove __future__ imports for legacy Python2 supports (#45033 ) Summary: There is a module called `2to3` which you can target for future specifically to remove these, the directory of `caffe2` has the most redundant imports: ```2to3 -f future -w caffe2``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/45033 Reviewed By: seemethere Differential Revision: D23808648 Pulled By: bugra fbshipit-source-id: 38971900f0fe43ab44a9168e57f2307580d36a38	2020-09-23 17:57:02 -07:00
Jianyu Huang	a840afbeb4	[pytorch][embeddingbag_8bit] Add include_last_offset option to Fused 8bit EmbeddingBag and parallelize the op (#32683 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/32683 Pull Request resolved: https://github.com/pytorch/glow/pull/4079 Similar to D17768404, we changed the EmbeddingBag operator for 8-bit fused version to add the option to include the last offset and parallelize the op. ghstack-source-id: 97404645 Test Plan: To generate the AVX2 code (`embedding_lookup_fused_8bit_rowwise_idx_avx2.cc`): ``` python hp_emblookup_codegen.py --fused --use-offsets ``` To test the correctness: ``` buck test //caffe2/torch/fb/sparsenn:test -- test_embedding_bag_byte_rowwise_offsets --print-passing-details ``` Reviewed By: yinghai Differential Revision: D19592761 fbshipit-source-id: f009d675ea3f2228f62e9f86b7ccb94700a0dfe0	2020-01-29 16:04:56 -08:00
Jianyu Huang	3ada2e0d64	[pytorch][embeddingbag] Parallelize the EmbeddingBag operator (#4049 ) Summary: Pull Request resolved: https://github.com/pytorch/glow/pull/4049 Pull Request resolved: https://github.com/pytorch/pytorch/pull/27477 We would like to add the intra-op parallelization support for the EmbeddingBag operator. This should bring speedup for the DLRM benchmark: https://github.com/pytorch/pytorch/pull/24385 Benchmark code: ``` from __future__ import absolute_import, division, print_function, unicode_literals import torch import time eb = torch.nn.EmbeddingBag(1000000, 64, mode='sum') input = torch.LongTensor(1500).random_(0, 1000000) offsets = torch.zeros(64, dtype=torch.int64) niter = 10000 s = time.time() for _ in range(niter): out = eb(input, offsets) time_per_iter = (time.time() - s) / niter print('time_per_iter', time_per_iter) print('GB/s', (input.numel() * 64 * 4 + out.numel() * 4) / time_per_iter / 1e9) ``` The following results are single core on Skylake T6: - Before our change (with the original caffe2::EmbeddingLookup) time_per_iter 6.313693523406982e-05 GB/s 6.341517821789133 - After our change using the EmbeddingLookupIdx API which takes the offsets instead of lengths. time_per_iter 5.7627105712890626e-05 GB/s 6.947841559053659 - With Intel's PR: https://github.com/pytorch/pytorch/pull/24385 time_per_iter 7.393271923065185e-05 GB/s 5.415518381664018 For multi-core performance, because Clang doesn't work with OMP, I can only see the single-core performance on SKL T6. ghstack-source-id: 97124557 Test Plan: With D16990830: ``` buck run mode/dev //caffe2/caffe2/perfkernels:embedding_bench ``` With D17750961: ``` buck run mode/opt //experimental/jianyuhuang/embeddingbag:eb buck run mode/opt-lto //experimental/jianyuhuang/embeddingbag:eb ``` OSS test ``` python run_test.py -i nn -- TestNNDeviceTypeCPU.test_EmbeddingBag_per_sample_weights_and_new_offsets_cpu ``` Buck test ``` buck test mode/dev-nosan //caffe2/test:nn -- "test_EmbeddingBag_per_sample_weights_and_new_offsets_cpu" OMP_NUM_THREADS=3 buck test mode/opt -c pytorch.parallel_backend=tbb //caffe2/test:nn -- "test_EmbeddingBag_per_sample_weights_and_new_offsets" --print-passing-details ``` Generate the AVX2 code for embedding_lookup_idx_avx2.cc: ``` python hp_emblookup_codegen.py --use-offsets ``` Differential Revision: D17768404 fbshipit-source-id: 8dcd15a62d75b737fa97e0eff17f347052675700	2020-01-23 21:29:44 -08:00
Jianyu Huang	ad7250d315	Make EmbeddingLookup APIs take offsets rather than lengths to match the PyTorch's EmbeddingBag (#24944 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/24944 As Title says, we would like to make the EmbeddingLookup APIs take offsets rather than lengths to match the PyTorch's EmbeddingBag. ghstack-source-id: 88883902 Test Plan: python hp_emblookup_codegen.py --use-offsets Check the benchmark in D16990830. Reviewed By: jspark1105 Differential Revision: D16924271 fbshipit-source-id: 7fac640c8587db59fd2304bb8e8d63c413f27cb8	2019-08-23 14:43:56 -07:00
Hector Yuen	26db46b324	change the epilogue of SLS to match the simd section (#21439 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/21439 this bug got exposed after testing accuracy on shapes not multiples of 8 Reviewed By: jspark1105 Differential Revision: D15684759 fbshipit-source-id: 2950f2bd87ee1d8e539148285a14c755f606b3a7	2019-06-05 18:41:55 -07:00
Jongsoo Park	101176870e	eliminate FE_INVALID exceptions related to fp16 conversion (#20390 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/20390 duc0 Ngo implemented observing floating point exceptions but there were a couple of places where we have "benign" floating point exceptions leading to false positives. This diff eliminates one source of such false positives, namely using _mm256_cvtph_ps and _mm256_cvtps_ph for partially uninitialized array for the remainder loop. Reviewed By: hx89 Differential Revision: D15307358 fbshipit-source-id: 38f57dfdd90c70bc693292d2f9c33c7ba558e2c9	2019-05-13 23:42:01 -07:00
Jongsoo Park	db121375e7	more careful use of inline/template function in perfkernels (#15388 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/15388 This is another pass to make perfkernels code safer from illegal instruction error. Removed dependency to c10/util/Logging.h We're err on the safer side at the expense of some verbosity. Reviewed By: dskhudia Differential Revision: D13502902 fbshipit-source-id: 4f833115df885c5b4f8c1ca83b9badea1553f944	2019-01-30 22:49:37 -08:00
Edward Yang	b9b160d86f	Remove ATen/Half.h and ATen/core/Half.h forwarding headers. Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/16115 Reviewed By: bddppq Differential Revision: D13717049 fbshipit-source-id: fb1d690183a932a1fa1a2d235f3219520f51620a	2019-01-18 10:55:21 -08:00
Tongliang Liao	55511004d1	Resolve errors in perfkernel for Windows (#16031 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/16031 1. MSVC only has _mm_prefetch(const char*, int). Fixed in both python codegen and C++ files. 2. uint32_t in "cvtsh_ss_bugfix.h" requires "#include <cstdint>". 3. Some files use gflags headers. Add dependency via c10. 4. Isolate arch flags with interface library and private compile options. Pull Request resolved: https://github.com/pytorch/pytorch/pull/15753 Reviewed By: dskhudia Differential Revision: D13636233 Pulled By: jspark1105 fbshipit-source-id: cdcbd4240e07b749554a2a5676c11af88f23c31d	2019-01-16 21:51:00 -08:00
Mickaël Schoentgen	71c6e24373	Fix several ResourceWarning: unclosed file (#15746 ) Summary: Hello, This is a patch to fix `ResourceWarning: unclosed file`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/15746 Differential Revision: D13587286 Pulled By: soumith fbshipit-source-id: 08ac34c5b51d9334867f65a2927bff11511553f3	2019-01-09 15:36:53 -08:00
Jongsoo Park	e012b183dd	handle empty inputs to SparseLengthsMean correctly (#15389 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/15389 SparseLengthsMean was generating uninitialized data for empty inputs (lengths == 0). We should return zeros. The unit tests were also not covering this special case which is fixed by this diff. Reviewed By: salexspb Differential Revision: D13515970 fbshipit-source-id: 3c35265638f64f13f0262cee930c94f8628005da	2018-12-21 22:20:14 -08:00
Jongsoo Park	1e0eab5df8	minimize header file includes from _avx2.cc (#14950 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/14950 Minimize the number of headers included from _avx2.cc files to avoid accidental compilation of functions defined the header files reused by other translation units that can lead to illegal instruction errors. Reviewed By: dskhudia Differential Revision: D13394483 fbshipit-source-id: 67149a6fb51f7f047e745bfe395cb6dd4ae7c1ae	2018-12-13 00:18:11 -08:00
Christian Puhrsch	a6630e25af	Remove many caffe2::TIndex and replace them with int64_t (#11943 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/11943 See title Reviewed By: ezyang Differential Revision: D9992645 fbshipit-source-id: e8f80d6ea762971513e5e8072975ceea53e1f11a	2018-09-22 18:11:04 -07:00
Roy Li	30521a37ad	codemod: caffe::float16 -> at::Half (#11785 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/11785 Replace each instead of float16 with Half. Reviewed By: Yangqing Differential Revision: D9892158 fbshipit-source-id: b9225ca7bd5c84fd1c04a9d24b026c8b6cbff120	2018-09-20 18:55:19 -07:00
Jongsoo Park	3300e21d52	Add SparseLengthsPositionalWeightedSum operator that fuses SparseLengthsWeightedSum, LengthsRangeFill, and Gather add SparseLengthsPositionalWeightedSum operator that fuses SparseLengthsWeightedSum, LengthsRangeFill, and Gather	2018-03-27 18:10:39 -07:00
Orion Reblitz-Richardson	1d5780d42c	Remove Apache headers from source. * LICENSE file contains details, so removing from individual source files.	2018-03-27 13:10:18 -07:00
Peter Goldsborough	d401c26d63	Add FusedEmbeddingLookup Summary: Updates the perfkernel codebase to implement embedding lookup for our new fused storage format, where each row in the data matrix stores the quantized values and the scale and bias. msmelyan see this as my best-effort attempt at updating the perfkernel stuff for the fused storage. Let me know if any of this is grossly wrong. I also don't know if we need to update any of the prefetching operations or something like that. Note that we have to keep the old code around for a bit until we get rid of the old operations with separate `scale_bias` storage. Reviewed By: kennyhorror Differential Revision: D6710843 fbshipit-source-id: b485ef2389f526c5db1260cac9d4be3fc8df0979	2018-01-19 15:44:34 -08:00
Dmytro Dzhulgakov	8548dd2486	Fix intrinsic in perf kerneles for int8 Summary: 8 bytes is 64 bits. Fixes out of range access caught by ASAN Reviewed By: Yangqing Differential Revision: D6219576 fbshipit-source-id: f7c418b12fa211890abcb5aef800bd456390b73a	2017-11-03 05:19:58 -07:00
Dmytro Dzhulgakov	583bc63c98	Fix boundary checking in 8-bit sparselengthssum ops Summary: Before the boundary checking was happening after the first access for 8bit ops. Reviewed By: Yangqing Differential Revision: D6206753 fbshipit-source-id: 07ab240cae8c67b3048f03aa79af0b6399b9940b	2017-11-03 05:19:57 -07:00
Yangqing Jia	8286ce1e3a	Re-license to Apache Summary: Closes https://github.com/caffe2/caffe2/pull/1260 Differential Revision: D5906739 Pulled By: Yangqing fbshipit-source-id: e482ba9ba60b5337d9165f28f7ec68d4518a0902	2017-09-28 16:22:00 -07:00
Misha Smelyanskiy	2cbb4167c1	Adding uint8 support for to code generator for and high-performance emebding look-up kernels, supporting Summary: Adding uint8 support for to code generator for and high-performance emebding look-up kernels, supporting Sum, WeightedSum, and Mean reducers. Added number of unit tests to test these operators. Performance Results =================== Performance results are below for old code, sparse_lengths_sum_benchmark.old.par, that uses code in lengths_reducer_rowwise_8bit_ops.h, and our new code, optimized via code generator, sparse_lengths_sum_benchmark.new.par. Block size was 128 in all cases. [root@fblearner001.01.ftw1 /home/msmelyan]# ./sparse_lengths_sum_benchmark.old.par --iteration 10000 --dtype uint8 I0912 02:49:58.773259 2640913 net_simple.cc:162] Time per operator type: I0912 02:49:58.773264 2640913 net_simple.cc:171] 0.75769 SparseLengthsSum8BitsRowwise [root@fblearner001.01.ftw1 /home/msmelyan]# ./sparse_lengths_sum_benchmark.new.par --iteration 10000 --dtype uint8 I0912 02:50:33.981832 2642102 net_simple.cc:162] Time per operator type: I0912 02:50:33.981837 2642102 net_simple.cc:171] 0.233322 SparseLengthsSum8BitsRowwise [root@fblearner001.01.ftw1 /home/msmelyan]# ./sparse_lengths_sum_benchmark.new.par --iteration 10000 --dtype float16 I0912 02:51:26.748972 2643925 net_simple.cc:162] Time per operator type: I0912 02:51:26.748977 2643925 net_simple.cc:171] 0.106591 SparseLengthsSum [root@fblearner001.01.ftw1 /home/msmelyan]# ./sparse_lengths_sum_benchmark.new.par --iteration 10000 --dtype float I0913 01:39:22.372238 1076874 net_simple.cc:162] Time per operator type: I0913 01:39:22.372244 1076874 net_simple.cc:171] 0.211041 SparseLengthsSum Analysis ======== Our optimized generated code is ~3.5x faster than original code in lengths_reducer_rowwise_8bit_ops.h as shown below. However, our uint8 is about 2x slower than float16 and is on par with float32. There are several reasons for that: 1. uint8 intrudoces extra instructions to multiply by bias and add scaling factors 2. In addition to emebding blocks, we are now also reading scale_bias. For every pair of scale and bias, we bring entire cache line of 64 bytes, whiles only using 8 bytes. 128-wide uint8 input block only occupies 2 cache lines and hence reading nearly entire extra cache lines of useless data adds to bandwidth wastage. 3. In addition, hardware prefetcher runs past the end of the input block and scale_bias cache line, trying to prefetch more useless lines. This effect was characterised in Appendix section of https://fb.facebook.com/notes/jason-lu/sparse-adagrad-performance-optimization-in-model-training/10214810437360961/ To get deeper insights into what is going on, we isolated SparseLengthsSum and SparseLengthsSum8BitsRowwise codes, for float32, float16 and uint8, into a microbenchmark, where we varried block size, while keeping table size constant (256MB) block_size time(uint8) time(float16) time(float32) 64 0.19 0.09 0.17 128 0.12 0.09 0.17 256 0.70 0.09 0.14 1024 0.50 0.06 0.10 The pattern for block size of 64 and 128 is similar to what we observed in sparse_lengths_sum_benchmark. However, we see that as block_size increases (for a fixed table size), time to perform embeddings decreases quite drastically. For block_size of 256 and beyond, uint8 starts achieving speedup over float16. Longer block better amortizes bandwidth wastage due to scale_bias and hardware prefetcher running past the end of the block. Reviewed By: kennyhorror Differential Revision: D5870907 fbshipit-source-id: 445321b96f1b5801ef91f296f6063c35673ee11b	2017-09-21 14:50:43 -07:00
Yangqing Jia	06b7a9e0f6	Backed out changeset 3a5c020294d8 Summary: Broke CAFFE2_HYPOTHESIS_PROFILE=debug buck test //caffe2/caffe2/python:lengths_reducer_rowwise_8bit_ops_test Reviewed By: kennyhorror Differential Revision: D5867880 fbshipit-source-id: 80c6f23eccb59b74be4a7258b4f193d79f814c3f	2017-09-19 17:54:18 -07:00
Misha Smelyanskiy	b468ffe6d1	Adding uint8 support for to code generator for and high-performance emebding look-up kernels, supporting Summary: Adding uint8 support for to code generator for and high-performance emebding look-up kernels, supporting Sum, WeightedSum, and Mean reducers. Added number of unit tests to test these operators. Performance Results =================== Performance results are below for old code, sparse_lengths_sum_benchmark.old.par, that uses code in lengths_reducer_rowwise_8bit_ops.h, and our new code, optimized via code generator, sparse_lengths_sum_benchmark.new.par. Block size was 128 in all cases. [root@fblearner001.01.ftw1 /home/msmelyan]# ./sparse_lengths_sum_benchmark.old.par --iteration 10000 --dtype uint8 I0912 02:49:58.773259 2640913 net_simple.cc:162] Time per operator type: I0912 02:49:58.773264 2640913 net_simple.cc:171] 0.75769 SparseLengthsSum8BitsRowwise [root@fblearner001.01.ftw1 /home/msmelyan]# ./sparse_lengths_sum_benchmark.new.par --iteration 10000 --dtype uint8 I0912 02:50:33.981832 2642102 net_simple.cc:162] Time per operator type: I0912 02:50:33.981837 2642102 net_simple.cc:171] 0.233322 SparseLengthsSum8BitsRowwise [root@fblearner001.01.ftw1 /home/msmelyan]# ./sparse_lengths_sum_benchmark.new.par --iteration 10000 --dtype float16 I0912 02:51:26.748972 2643925 net_simple.cc:162] Time per operator type: I0912 02:51:26.748977 2643925 net_simple.cc:171] 0.106591 SparseLengthsSum [root@fblearner001.01.ftw1 /home/msmelyan]# ./sparse_lengths_sum_benchmark.new.par --iteration 10000 --dtype float I0913 01:39:22.372238 1076874 net_simple.cc:162] Time per operator type: I0913 01:39:22.372244 1076874 net_simple.cc:171] 0.211041 SparseLengthsSum Analysis ======== Our optimized generated code is ~3.5x faster than original code in lengths_reducer_rowwise_8bit_ops.h as shown below. However, our uint8 is about 2x slower than float16 and is on par with float32. There are several reasons for that: 1. uint8 intrudoces extra instructions to multiply by bias and add scaling factors 2. In addition to emebding blocks, we are now also reading scale_bias. For every pair of scale and bias, we bring entire cache line of 64 bytes, whiles only using 8 bytes. 128-wide uint8 input block only occupies 2 cache lines and hence reading nearly entire extra cache lines of useless data adds to bandwidth wastage. 3. In addition, hardware prefetcher runs past the end of the input block and scale_bias cache line, trying to prefetch more useless lines. This effect was characterised in Appendix section of https://fb.facebook.com/notes/jason-lu/sparse-adagrad-performance-optimization-in-model-training/10214810437360961/ To get deeper insights into what is going on, we isolated SparseLengthsSum and SparseLengthsSum8BitsRowwise codes, for float32, float16 and uint8, into a microbenchmark, where we varried block size, while keeping table size constant (256MB) block_size time(uint8) time(float16) time(float32) 64 0.19 0.09 0.17 128 0.12 0.09 0.17 256 0.70 0.09 0.14 1024 0.50 0.06 0.10 The pattern for block size of 64 and 128 is similar to what we observed in sparse_lengths_sum_benchmark. However, we see that as block_size increases (for a fixed table size), time to perform embeddings decreases quite drastically. For block_size of 256 and beyond, uint8 starts achieving speedup over float16. Longer block better amortizes bandwidth wastage due to scale_bias and hardware prefetcher running past the end of the block. Reviewed By: dzhulgakov Differential Revision: D5824641 fbshipit-source-id: 3a5c020294d84874da78c6943e596423393473d6	2017-09-19 10:50:09 -07:00
Misha Smelyanskiy	080fab8f6c	Code generator for and high-performance emebding look-up kernels, supporting Summary: Code generator for and high-performance emebding look-up kernels, supporting Sum, WeightedSum, and Mean reducers. Achieve at least 1.5x speedup on float and over 2x speedup for float16, compared to existing code These are results on Broadwell, using sparse_lengths_sum_benchmar.par benchmark Old ============== [root@fblearner001.01.ftw1 /home/msmelyan]# numactl -m 0 -C 0 ./sparse_lengths_sum_benchmark.par --iteration 10000 Preparing lookup table. 2017-08-08 00:10:23.101848 Preparation finished. 2017-08-08 00:10:27.955680 I0808 00:10:27.955732 30700 net.cc:177] Starting benchmark. I0808 00:10:27.955759 30700 net.cc:178] Running warmup runs. I0808 00:10:27.956367 30700 net.cc:188] Main runs. I0808 00:10:31.839035 30700 net.cc:199] Main run finished. Milliseconds per iter: 0.388264. Iters per second: 2575.56 I0808 00:10:35.704169 30700 net.cc:233] Operator #0 (indices, Python) 0.0583264 ms/iter I0808 00:10:35.704210 30700 net.cc:233] Operator #1 (Y, SparseLengthsSum) 0.327694 ms/iter I0808 00:10:35.704213 30700 net.cc:237] Time per operator type: I0808 00:10:35.704217 30700 net.cc:246] 0.327694 SparseLengthsSum I0808 00:10:35.704221 30700 net.cc:246] 0.0583264 Python [root@fblearner001.01.ftw1 /home/msmelyan]# numactl -m 0 -C 0 ./sparse_lengths_sum_benchmark.par --iteration 10000 --dtype float16 Preparing lookup table. 2017-08-08 00:10:59.047159 Preparation finished. 2017-08-08 00:11:05.140565 I0808 00:11:05.140612 31725 net.cc:177] Starting benchmark. I0808 00:11:05.140635 31725 net.cc:178] Running warmup runs. I0808 00:11:05.141104 31725 net.cc:188] Main runs. I0808 00:11:08.371510 31725 net.cc:199] Main run finished. Milliseconds per iter: 0.323039. Iters per second: 3095.6 I0808 00:11:11.671450 31725 net.cc:233] Operator #0 (indices, Python) 0.0609876 ms/iter I0808 00:11:11.671489 31725 net.cc:233] Operator #1 (Y, SparseLengthsSum) 0.26856 ms/iter I0808 00:11:11.671494 31725 net.cc:237] Time per operator type: I0808 00:11:11.671497 31725 net.cc:246] 0.26856 SparseLengthsSum I0808 00:11:11.671500 31725 net.cc:246] 0.0609876 Python New (Misha's) ============== [root@fblearner001.01.ftw1 /home/msmelyan]# numactl -m 0 -C 0 ./sparse_lengths_sum_benchmark.par --iteration 10000 Preparing lookup table. 2017-08-07 23:44:55.897748 Preparation finished. 2017-08-07 23:45:00.708896 I0807 23:45:00.708945 4178361 net.cc:177] Starting benchmark. I0807 23:45:00.708971 4178361 net.cc:178] Running warmup runs. I0807 23:45:00.709444 4178361 net.cc:188] Main runs. I0807 23:45:03.608551 4178361 net.cc:199] Main run finished. Milliseconds per iter: 0.289909. Iters per second: 3449.36 I0807 23:45:06.536182 4178361 net.cc:233] Operator #0 (indices, Python) 0.0572399 ms/iter I0807 23:45:06.536224 4178361 net.cc:233] Operator #1 (Y, SparseLengthsSum) 0.23512 ms/iter I0807 23:45:06.536228 4178361 net.cc:237] Time per operator type: I0807 23:45:06.536232 4178361 net.cc:246] 0.23512 SparseLengthsSum I0807 23:45:06.536236 4178361 net.cc:246] 0.0572399 Python [root@fblearner001.01.ftw1 /home/msmelyan]# numactl -m 0 -C 0 ./sparse_lengths_sum_benchmark.par --iteration 10000 --dtype float16 Preparing lookup table. 2017-08-07 23:45:17.191579 Preparation finished. 2017-08-07 23:45:23.173668 I0807 23:45:23.173715 4179316 net.cc:177] Starting benchmark. I0807 23:45:23.173743 4179316 net.cc:178] Running warmup runs. I0807 23:45:23.174090 4179316 net.cc:188] Main runs. I0807 23:45:24.939749 4179316 net.cc:199] Main run finished. Milliseconds per iter: 0.176564. Iters per second: 5663.67 I0807 23:45:26.698885 4179316 net.cc:233] Operator #0 (indices, Python) 0.0557303 ms/iter I0807 23:45:26.698923 4179316 net.cc:233] Operator #1 (Y, SparseLengthsSum) 0.119794 ms/iter I0807 23:45:26.698927 4179316 net.cc:237] Time per operator type: I0807 23:45:26.698931 4179316 net.cc:246] 0.119794 SparseLengthsSum I0807 23:45:26.698935 4179316 net.cc:246] 0.0557303 Python Reviewed By: salexspb Differential Revision: D5582172 fbshipit-source-id: d71f5a55580b734a51b8f30852b75f379acfdaf2	2017-08-30 16:22:11 -07:00

33 Commits