Summary:
Ignore mixed upper-case/lower-case style for now
Fix space between function and its arguments violation
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35574
Test Plan: CI
Differential Revision: D20712969
Pulled By: malfet
fbshipit-source-id: 0012d430aed916b4518599a0b535e82d15721f78
Summary:
As a followup to https://github.com/pytorch/pytorch/pull/35042 this removes python2 from setup.py and adds Python 3.8 to the list of supported versions. We're already testing this in CircleCI.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35539
Differential Revision: D20709060
Pulled By: orionr
fbshipit-source-id: 5d40bc14cb885374fec370fc7c5d3cde8769039a
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35073
We want to do constant propagation for quantize_per_tensor/quantize_per_channel
which will produce results that's consumed by these ops, and since we need to
make sure the output of the node has no writer before constant prop through the node,
the consumer needs to be pure as well.
Test Plan:
see next PR
Imported from OSS
Differential Revision: D20655310
fbshipit-source-id: 3e33662224c21b889c8121b823f8ce0b7da75eed
Summary:
So that packages are correctly marked when looking through the html
pages.
Signed-off-by: Eli Uriegas <eliuriegas@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35309
Differential Revision: D20626737
Pulled By: seemethere
fbshipit-source-id: 0fad3d99f0b0086898939fde94ddbbc9861d257e
Summary:
Let see if it makes both test branches a bit more balanced
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35540
Test Plan: CI
Differential Revision: D20704642
Pulled By: malfet
fbshipit-source-id: 4e2ab5a80adfe78620206d4eaea30207194379cc
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34248
This argument will no longer exist in positional form when MemoryFormat
is moved into TensorOptions by codegen, so we must stop using it when
we make calls from C++. This diff eliminates all direct positional
calls, making them be passed in using TensorOptions.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Test Plan: Imported from OSS
Differential Revision: D20683398
Pulled By: bhosmer
fbshipit-source-id: 6928cfca67abb22fbc667ecc2af8453d93489bd6
Summary:
Since we've done the branch cut for 1.5.0 we should bump nightlies to 1.6.0
Signed-off-by: Eli Uriegas <eliuriegas@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35495
Differential Revision: D20697043
Pulled By: seemethere
fbshipit-source-id: 3646187a5e729994138bf2c68625f25f11430b3a
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35519
Fix include of THHalf.h to be TH/THHalf.h. Makes the include consistent with the rest of caffe2.
Test Plan: CI
Differential Revision: D20685997
fbshipit-source-id: 893b6e96e4f1a1e7306ba2e40e4e8ee738f0344f
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35545
Looks like we have never printed a quantized Tensor in cpp before
(Note: this ignores all push blocking failures!)
Test Plan:
.
Imported from OSS
Differential Revision: D20699748
fbshipit-source-id: 9d029815c6e75f626afabf92194154efc83f5545
Summary:
Skip tests that take more than finish under a sec normally but take 20+ min under ASAN
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35533
Test Plan: CI
Differential Revision: D20700245
Pulled By: malfet
fbshipit-source-id: 7620b12d3aba1bafb2baa9073fa27c4a0b3dd9eb
Summary:
Fixes incorrect usages of symbol annotations including:
1. Exporting or importing a function/class in an anonymous namespace.
2. Exporting or importing a function/class implementation in a header file. However, by removing the symbol annotations, they are now local symbols. If they need to be remain global, I can move the implementations to the source file.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35364
Differential Revision: D20670031
Pulled By: ezyang
fbshipit-source-id: cd8018dee703e2424482c27fe9608e040d8105b8
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34555
This is sometimes necessary, such as when T=int and the step size is of
type double.
Test Plan: Imported from OSS
Differential Revision: D20687063
Pulled By: ezyang
fbshipit-source-id: 33086d4252d06e7539733a9b1b3d6774e177b6da
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35244
add roi_align_rotated op to lite interpreter for detectron2go model
(Note: this ignores all push blocking failures!)
Test Plan: try to run model in https://home.fburl.com/~stzpz/text_det/fbnet_300_20/
Reviewed By: iseeyuan
Differential Revision: D20560485
fbshipit-source-id: a81f3a590b9cc5a02d4da676b3cfa52b0e0a68c3
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35247
add a leading "_" to register quantized ops for lite interpreter. They are needed by d2go model
(Note: this ignores all push blocking failures!)
Test Plan:
(whole stack)
buck build -c user.ndk_cxxflags='-g1' -c caffe2.expose_op_to_c10=1 //xplat/caffe2/fb/pytorch_predictor:maskrcnnAndroid#android-armv7
Reviewed By: iseeyuan
Differential Revision: D20528760
fbshipit-source-id: 5b26d075456641b02d82f15a2d19f2266001f23b
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34674
Two changes to make sure the op_names dumped in export_opnames() are consistent to what are actually used in bytecode.
* Inline graph before dumping the operator names.
* Use code of the graph (which is used in bytecode) instead of the nodes of graph.
Test Plan: Imported from OSS
Differential Revision: D20610715
Pulled By: iseeyuan
fbshipit-source-id: 53fa9c3b36f4f242b7f2b99b421f4adf20d4b1f6
Summary:
My PR https://github.com/pytorch/pytorch/pull/33020 changed subgraph_utils made subgraph utils non-deterministic by using a set instead of a vector for closed over values. This broke a downstream glow test. We're in the process of working with glow to not rely on the subgraph input order, but in the interim make it ordered again to fix the test.
An alternative is to use a `set` instead of a vector, but I don't particularly like committing to fixed ordering for the subgraph, especially for things like if nodes and while loops where an order doesn't really have any meaning.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35508
Differential Revision: D20683959
Pulled By: eellison
fbshipit-source-id: bb39b29fef2904e52b9dc42be194bb57cbea59c4
Summary:
## Motivation
This PR upgrades MKL-DNN from v0.20 to DNNL v1.2 and resolves https://github.com/pytorch/pytorch/issues/30300.
DNNL (Deep Neural Network Library) is the new brand of MKL-DNN, which improves performance, quality, and usability over the old version.
This PR focuses on the migration of all existing functionalities, including minor fixes, performance improvement and code clean up. It serves as the cornerstone of our future efforts to accommodate new features like OpenCL support, BF16 training, INT8 inference, etc. and to let the Pytorch community derive more benefits from the Intel Architecture.
<br>
## What's included?
Even DNNL has many breaking changes to the API, we managed to absorb most of them in ideep. This PR contains minimalist changes to the integration code in pytorch. Below is a summary of the changes:
<br>
**General:**
1. Replace op-level allocator with global-registered allocator
```
// before
ideep::sum::compute<AllocForMKLDNN>(scales, {x, y}, z);
// after
ideep::sum::compute(scales, {x, y}, z);
```
The allocator is now being registeted at `aten/src/ATen/native/mkldnn/IDeepRegistration.cpp`. Thereafter all tensors derived from the `cpu_engine` (by default) will use the c10 allocator.
```
RegisterEngineAllocator cpu_alloc(
ideep::engine::cpu_engine(),
[](size_t size) {
return c10::GetAllocator(c10::DeviceType::CPU)->raw_allocate(size);
},
[](void* p) {
c10::GetAllocator(c10::DeviceType::CPU)->raw_deallocate(p);
}
);
```
------
2. Simplify group convolution
We had such a scenario in convolution where ideep tensor shape mismatched aten tensor: when `groups > 1`, DNNL expects weights tensors to be 5-d with an extra group dimension, e.g. `goihw` instead of `oihw` in 2d conv case.
As shown below, a lot of extra checks came with this difference in shape before. Now we've completely hidden this difference in ideep and all tensors are going to align with pytorch's definition. So we could safely remove these checks from both aten and c2 integration code.
```
// aten/src/ATen/native/mkldnn/Conv.cpp
if (w.ndims() == x.ndims() + 1) {
AT_ASSERTM(
groups > 1,
"Only group _mkldnn_conv2d weights could have been reordered to 5d");
kernel_size[0] = w.get_dim(0) * w.get_dim(1);
std::copy_n(
w.get_dims().cbegin() + 2, x.ndims() - 1, kernel_size.begin() + 1);
} else {
std::copy_n(w.get_dims().cbegin(), x.ndims(), kernel_size.begin());
}
```
------
3. Enable DNNL built-in cache
Previously, we stored DNNL jitted kernels along with intermediate buffers inside ideep using an LRU cache. Now we are switching to the newly added DNNL built-in cache, and **no longer** caching buffers in order to reduce memory footprint.
This change will be mainly reflected in lower memory usage from memory profiling results. On the code side, we removed couple of lines of `op_key_` that depended on the ideep cache before.
------
4. Use 64-bit integer to denote dimensions
We changed the type of `ideep::dims` from `vector<int32_t>` to `vector<int64_t>`. This renders ideep dims no longer compatible with 32-bit dims used by caffe2. So we use something like `{stride_.begin(), stride_.end()}` to cast parameter `stride_` into a int64 vector.
<br>
**Misc changes in each commit:**
**Commit:** change build options
Some build options were slightly changed, mainly to avoid name collisions with other projects that include DNNL as a subproject. In addition, DNNL built-in cache is enabled by option `DNNL_ENABLE_PRIMITIVE_CACHE`.
Old | New
-- | --
WITH_EXAMPLE | MKLDNN_BUILD_EXAMPLES
WITH_TEST | MKLDNN_BUILD_TESTS
MKLDNN_THREADING | MKLDNN_CPU_RUNTIME
MKLDNN_USE_MKL | N/A (not use MKL anymore)
------
**Commit:** aten reintegration
- aten/src/ATen/native/mkldnn/BinaryOps.cpp
Implement binary ops using new operation `binary` provided by DNNL
- aten/src/ATen/native/mkldnn/Conv.cpp
Clean up group convolution checks
Simplify conv backward integration
- aten/src/ATen/native/mkldnn/MKLDNNConversions.cpp
Simplify prepacking convolution weights
- test/test_mkldnn.py
Fixed an issue in conv2d unit test: it didn't check conv results between mkldnn and aten implementation before. Instead, it compared the mkldnn with mkldnn as the default cpu path will also go into mkldnn. Now we use `torch.backends.mkldnn.flags` to fix this issue
- torch/utils/mkldnn.py
Prepack weight tensor on module `__init__` to achieve better performance significantly
------
**Commit:** caffe2 reintegration
- caffe2/ideep/ideep_utils.h
Clean up unused type definitions
- caffe2/ideep/operators/adam_op.cc & caffe2/ideep/operators/momentum_sgd_op.cc
Unify tensor initialization with `ideep::tensor::init`. Obsolete `ideep::tensor::reinit`
- caffe2/ideep/operators/conv_op.cc & caffe2/ideep/operators/quantization/int8_conv_op.cc
Clean up group convolution checks
Revamp convolution API
- caffe2/ideep/operators/conv_transpose_op.cc
Clean up group convolution checks
Clean up deconv workaround code
------
**Commit:** custom allocator
- Register c10 allocator as mentioned above
<br><br>
## Performance
We tested inference on some common models based on user scenarios, and most performance numbers are either better than or on par with DNNL 0.20.
ratio: new / old | Latency (batch=1 4T) | Throughput (batch=64 56T)
-- | -- | --
pytorch resnet18 | 121.4% | 99.7%
pytorch resnet50 | 123.1% | 106.9%
pytorch resnext101_32x8d | 116.3% | 100.1%
pytorch resnext50_32x4d | 141.9% | 104.4%
pytorch mobilenet_v2 | 163.0% | 105.8%
caffe2 alexnet | 303.0% | 99.2%
caffe2 googlenet-v3 | 101.1% | 99.2%
caffe2 inception-v1 | 102.2% | 101.7%
caffe2 mobilenet-v1 | 356.1% | 253.7%
caffe2 resnet101 | 100.4% | 99.8%
caffe2 resnet152 | 99.8% | 99.8%
caffe2 shufflenet | 141.1% | 69.0% †
caffe2 squeezenet | 98.5% | 99.2%
caffe2 vgg16 | 136.8% | 100.6%
caffe2 googlenet-v3 int8 | 100.0% | 100.7%
caffe2 mobilenet-v1 int8 | 779.2% | 943.0%
caffe2 resnet50 int8 | 99.5% | 95.5%
_Configuration:
Platform: Skylake 8180
Latency Test: 4 threads, warmup 30, iteration 500, batch size 1
Throughput Test: 56 threads, warmup 30, iteration 200, batch size 64_
† Shufflenet is one of the few models that require temp buffers during inference. The performance degradation is an expected issue since we no longer cache any buffer in the ideep. As for the solution, we suggest users opt for caching allocator like **jemalloc** as a drop-in replacement for system allocator in such heavy workloads.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32422
Test Plan:
Perf results: https://our.intern.facebook.com/intern/fblearner/details/177790608?tab=Experiment%20Results
10% improvement for ResNext with avx512, neutral on avx2
More results: https://fb.quip.com/ob10AL0bCDXW#NNNACAUoHJP
Reviewed By: yinghai
Differential Revision: D20381325
Pulled By: dzhulgakov
fbshipit-source-id: 803b906fd89ed8b723c5fcab55039efe3e4bcb77
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35042
Removing python2 tests and some compat code in torch.jit. Check if dependent projects and external tests have any issues after these changes.
Test Plan: waitforsandcastle
Reviewed By: suo, seemethere
Differential Revision: D18942633
fbshipit-source-id: d76cc41ff20bee147dd8d44d70563c10d8a95a35
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35393
this was being created inside the lock scope, but we don't need to
hold the lock for this.
ghstack-source-id: 100953426
Test Plan: CI
Differential Revision: D20632225
fbshipit-source-id: dbf6746f638b7df5fefd9bbfceaa6b1a542580e2
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35491
The goal of this diff is to avoid having to set AutoNonVariableTypeMode guard
in client code that uses custom mobile build. The guard was necessary because
custom mobile build might not include variable kernels, in which AutoNonVariableTypeMode
guard is usually set. It's hard to enforce all callsites to follow this rule, so
we make this change to simplify it.
Another goal of the diff is to not break FL where real variable kernels are
registered.
ghstack-source-id: 100944553
Test Plan:
- With stacked diff, tested lite-trainer with MnistModel:
```
buck run xplat/caffe2/fb/lite_trainer:lite_trainer \
-c pt.disable_gen_tracing=1 \
-- --model=/home/liujiakai/ptmodels/MnistModel.bc
```
- Will test with the papaya sample app.
Differential Revision: D20643627
fbshipit-source-id: 37ea937919259c183809c2b7acab0741eff84d33
Summary:
1. Removed LossClosureOptimizer, and merged Optimizer into OptimizerBase (and renamed the merged class to Optimizer)
2. Merged the LBFGS-specific serialize test function and the generic test_serialize_optimizer function.
3. BC-compatibility serialization test for LBFGS
4. Removed mentions of parameters_ in optimizer.cpp, de-virtualize all functions
5. Made defaults_ optional argument in all optimizers except SGD
**TODO**: add BC-breaking notes for this PR
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34957
Test Plan: Imported from GitHub, without a `Test Plan:` line.
Differential Revision: D20678162
Pulled By: yf225
fbshipit-source-id: 74e062e42d86dc118f0fbaddd794e438b2eaf35a
Summary:
Desugar prim::shape to aten::size so that passes don't need to reason about both ops. Serialized models still resolve to `prim::shape` so this doesn't break BC.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34286
Differential Revision: D20316818
Pulled By: eellison
fbshipit-source-id: d1585687212843f51e9396e07c108f5c08017818
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35433
Make RRef TorchScript API the same as RRef Python API.
Differential Revision: D7923050
fbshipit-source-id: 62589a429bcaa834b55db6ae8cfb10c0a2ee01ff
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35430
This fixes and adds tests for several commonly used operators.
There's some formatting differences due to running clang-format on one of the files.
Test Plan: buck test //caffe2/caffe2/fb/operators:hypothesis_test //caffe2/caffe2/python/operator_test:utility_ops_test //caffe2/caffe2/python/operator_test:concat_split_op_test
Reviewed By: yyetim
Differential Revision: D20657405
fbshipit-source-id: 51d86d0834003b8ac8d6acb5149ae13d7bbfc6ab
Summary:
Looks like there is a bug in CUDA device linker, but kernels that uses `thust::sort_by_key` can not be linked with other kernels
Solve the problem by splitting 5 thrust-heavy .cu files into `__torch_cuda_sp` library which is statically linked into `torch_cuda`
For default compilation workflow it should not make any difference.
Test Plan: Compile with `-DCUDA_SEPARABLE_COMPILATION=YES` and observe library size difference: 310Mb before, 173Mb after if compiled for sm_75
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34863
Differential Revision: D20683972
Pulled By: malfet
fbshipit-source-id: bc1492aa9d1d2d21c48e8764a8a7b403feaec5da