Summary:
fix Semmle warning: Comparison of narrow type with wide type in loop condition
For example there is below piece of code:
for (int i=0; i<array.size(); ++i) {}
The problem is that array.size() return type is size_t can be larger type than int depending on the implementation so there is chance that i overflows (for very large array that array size is beyond the range of integer) and this loop will never be terminated.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53951
Reviewed By: zou3519
Differential Revision: D27181495
Pulled By: malfet
fbshipit-source-id: 0612c5cedcdc656c193085e7fbb87dd163f20688
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46758
It's in general helpful to support int32 indices and offsets, especially when such tensors are large and need to be transferred to accelerator backends. Since it may not be very useful to support the combination of int32 indices and int64 offsets, here we enforce that these two must have the same type.
Test Plan: unit tests
Reviewed By: ngimel
Differential Revision: D24470808
fbshipit-source-id: 94b8a1d0b7fc9fe3d128247aa042c04d7c227f0b
Summary:
Pull Request resolved: https://github.com/pytorch/glow/pull/4049
Pull Request resolved: https://github.com/pytorch/pytorch/pull/27477
We would like to add the intra-op parallelization support for the EmbeddingBag operator.
This should bring speedup for the DLRM benchmark:
https://github.com/pytorch/pytorch/pull/24385
Benchmark code:
```
from __future__ import absolute_import, division, print_function, unicode_literals
import torch
import time
eb = torch.nn.EmbeddingBag(1000000, 64, mode='sum')
input = torch.LongTensor(1500).random_(0, 1000000)
offsets = torch.zeros(64, dtype=torch.int64)
niter = 10000
s = time.time()
for _ in range(niter):
out = eb(input, offsets)
time_per_iter = (time.time() - s) / niter
print('time_per_iter', time_per_iter)
print('GB/s', (input.numel() * 64 * 4 + out.numel() * 4) / time_per_iter / 1e9)
```
The following results are single core on Skylake T6:
- Before our change (with the original caffe2::EmbeddingLookup)
time_per_iter 6.313693523406982e-05
GB/s 6.341517821789133
- After our change using the EmbeddingLookupIdx API which takes the offsets instead of lengths.
time_per_iter 5.7627105712890626e-05
GB/s 6.947841559053659
- With Intel's PR: https://github.com/pytorch/pytorch/pull/24385
time_per_iter 7.393271923065185e-05
GB/s 5.415518381664018
For multi-core performance, because Clang doesn't work with OMP, I can only see the single-core performance on SKL T6.
ghstack-source-id: 97124557
Test Plan:
With D16990830:
```
buck run mode/dev //caffe2/caffe2/perfkernels:embedding_bench
```
With D17750961:
```
buck run mode/opt //experimental/jianyuhuang/embeddingbag:eb
buck run mode/opt-lto //experimental/jianyuhuang/embeddingbag:eb
```
OSS test
```
python run_test.py -i nn -- TestNNDeviceTypeCPU.test_EmbeddingBag_per_sample_weights_and_new_offsets_cpu
```
Buck test
```
buck test mode/dev-nosan //caffe2/test:nn -- "test_EmbeddingBag_per_sample_weights_and_new_offsets_cpu"
OMP_NUM_THREADS=3 buck test mode/opt -c pytorch.parallel_backend=tbb //caffe2/test:nn -- "test_EmbeddingBag_per_sample_weights_and_new_offsets" --print-passing-details
```
Generate the AVX2 code for embedding_lookup_idx_avx2.cc:
```
python hp_emblookup_codegen.py --use-offsets
```
Differential Revision: D17768404
fbshipit-source-id: 8dcd15a62d75b737fa97e0eff17f347052675700
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/27635
PyTorch uses `offsets` instead of `lengths` for embedding table lookup. Adding support to that for fused quantized version.
AVX2 version is generated with
```
python caffe2/caffe2/perfkernels/hp_emblookup_codegen.py --fused --use-offsets
```
Test Plan:
```
buck test caffe2/torch/fb/sparsenn:test
```
Reviewed By: jianyuh
Differential Revision: D17826873
fbshipit-source-id: 23c4a96d92521deaebc02b688ad735d76a4476df
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/25670
This is part of the effort to get rid of protobuf dependency for
libtorch mobile build.
embedding_lookup_idx.cc is used by ATen/EmbeddingBag.cpp. It indirectly
includes caffe2.pb.h but doesn't really need it. Clean up the headers to
unblock no-protobuf mobile build.
The broader problem is that many common headers in pytorch/caffe2 directly
or indirectly include caffe2.pb.h. After landing the stack of changes to
remove protobuf from OSS libtorch mobile build, it's going to constraint
how ATen and other parts of pytorch use caffe2 components: it will break
OSS mobile CI if a PR introduces a dependency to a caffe2 file that
indirectly includes caffe2.pb.h. We will need to tease out caffe2.pb.h
dependencies like in this diff, or do a refactor to replace protobuf
generated types.
Chatted with gchanan and ezyang to confirm that there is no plan to
add more dependencies to caffe2 components from ATen in near future,
so this should be fine.
Test Plan: - build locally with stacked diffs
Differential Revision: D17191913
Pulled By: ljk53
fbshipit-source-id: 1248fe6424060a8bedcf20e73942b7500ae5e815
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/24944
As Title says, we would like to make the EmbeddingLookup APIs take offsets rather than lengths to match the PyTorch's EmbeddingBag.
ghstack-source-id: 88883902
Test Plan:
python hp_emblookup_codegen.py --use-offsets
Check the benchmark in D16990830.
Reviewed By: jspark1105
Differential Revision: D16924271
fbshipit-source-id: 7fac640c8587db59fd2304bb8e8d63c413f27cb8