Commit Graph

25206 Commits

Author SHA1 Message Date
b9adbb5002 Fix/relax CMake linter rules (#35574)
Summary:
Ignore mixed upper-case/lower-case style for now
Fix space between function and its arguments violation
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35574

Test Plan: CI

Differential Revision: D20712969

Pulled By: malfet

fbshipit-source-id: 0012d430aed916b4518599a0b535e82d15721f78
2020-03-27 16:52:33 -07:00
96eec95ece torch.from_numpy for complex dtypes (#35531)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/35531

Differential Revision: D20693581

Pulled By: anjali411

fbshipit-source-id: d53e26b4175452fa00b287efbfceea18104c1364
2020-03-27 14:40:28 -07:00
f101949390 Remove python2 support from setup.py (#35539)
Summary:
As a followup to https://github.com/pytorch/pytorch/pull/35042 this removes python2 from setup.py and adds Python 3.8 to the list of supported versions. We're already testing this in CircleCI.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35539

Differential Revision: D20709060

Pulled By: orionr

fbshipit-source-id: 5d40bc14cb885374fec370fc7c5d3cde8769039a
2020-03-27 14:33:11 -07:00
45c9ed825a Formatting cmake (to lowercase without space for if/elseif/else/endif) (#35521)
Summary:
Running commands:
```bash
shopt -s globstar

sed -e 's/IF (/if(/g' -e 's/IF(/if(/g' -e 's/if (/if(/g' -e 's/ELSE (/else(/g' -e 's/ELSE(/else(/g' -e 's/else (/else(/g' -e 's/ENDif(/endif(/g' -e 's/ELSEif(/elseif(/g' -i CMakeLists.txt
sed -e 's/IF (/if(/g' -e 's/IF(/if(/g' -e 's/if (/if(/g' -e 's/ELSE (/else(/g' -e 's/ELSE(/else(/g' -e 's/else (/else(/g' -e 's/ENDif(/endif(/g' -e 's/ELSEif(/elseif(/g' -i caffe2/**/CMakeLists.txt
sed -e 's/IF (/if(/g' -e 's/IF(/if(/g' -e 's/if (/if(/g' -e 's/ELSE (/else(/g' -e 's/ELSE(/else(/g' -e 's/else (/else(/g' -e 's/ENDif(/endif(/g' -e 's/ELSEif(/elseif(/g' -i torch/**/CMakeLists.txt
sed -e 's/IF (/if(/g' -e 's/IF(/if(/g' -e 's/if (/if(/g' -e 's/ELSE (/else(/g' -e 's/ELSE(/else(/g' -e 's/else (/else(/g' -e 's/ENDif(/endif(/g' -e 's/ELSEif(/elseif(/g' -i c10/**/CMakeLists.txt
sed -e 's/IF (/if(/g' -e 's/IF(/if(/g' -e 's/if (/if(/g' -e 's/ELSE (/else(/g' -e 's/ELSE(/else(/g' -e 's/else (/else(/g' -e 's/ENDif(/endif(/g' -e 's/ELSEif(/elseif(/g' -i cmake/**/*.cmake
sed -e 's/IF (/if(/g' -e 's/IF(/if(/g' -e 's/if (/if(/g' -e 's/ELSE (/else(/g' -e 's/ELSE(/else(/g' -e 's/else (/else(/g' -e 's/ENDif(/endif(/g' -e 's/ELSEif(/elseif(/g' -i cmake/**/*.cmake.in
```
We may further convert all the commands into lowercase according to the following issue: 77543bde41.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35521

Differential Revision: D20704382

Pulled By: malfet

fbshipit-source-id: 42186b9b1660c34428ab7ceb8d3f7a0ced5d2e80
2020-03-27 14:25:17 -07:00
04a3345335 [quant] Make conv2d_prepack and linear_prepack pure (#35073)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35073

We want to do constant propagation for quantize_per_tensor/quantize_per_channel
which will produce results that's consumed by these ops, and since we need to
make sure the output of the node has no writer before constant prop through the node,
the consumer needs to be pure as well.

Test Plan:
see next PR

Imported from OSS

Differential Revision: D20655310

fbshipit-source-id: 3e33662224c21b889c8121b823f8ce0b7da75eed
2020-03-27 14:19:32 -07:00
e1773f2ac0 .circleci: Change default CUDA for pip, cu101 -> cu102 (#35309)
Summary:
So that packages are correctly marked when looking through the html
pages.

Signed-off-by: Eli Uriegas <eliuriegas@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35309

Differential Revision: D20626737

Pulled By: seemethere

fbshipit-source-id: 0fad3d99f0b0086898939fde94ddbbc9861d257e
2020-03-27 14:13:37 -07:00
02d6e6e55f histc: Add a note on elements outside of given bounds (#34889)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/34889

Differential Revision: D20625916

Pulled By: albanD

fbshipit-source-id: febb769f40d86bae8e1c7bb51d719b92bf4a572d
2020-03-27 14:04:51 -07:00
4529d03971 Move test_libtorch from win-test2 to win-test1 group (#35540)
Summary:
Let see if it makes both test branches a bit more balanced
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35540

Test Plan: CI

Differential Revision: D20704642

Pulled By: malfet

fbshipit-source-id: 4e2ab5a80adfe78620206d4eaea30207194379cc
2020-03-27 13:10:53 -07:00
ef511d884b Calls to _empty_affine_quantized pass MemoryFormat by TensorOptions (#34248)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34248

This argument will no longer exist in positional form when MemoryFormat
is moved into TensorOptions by codegen, so we must stop using it when
we make calls from C++.  This diff eliminates all direct positional
calls, making them be passed in using TensorOptions.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Differential Revision: D20683398

Pulled By: bhosmer

fbshipit-source-id: 6928cfca67abb22fbc667ecc2af8453d93489bd6
2020-03-27 13:02:13 -07:00
05e973d673 Add WorkerInfo through TorchBind to make it an available type in TorchScript (#35447)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35447

as titled

Test Plan:
```
buck test mode/dev-nosan //caffe2/test/distributed/rpc/jit:rpc_fork

buck build mode/dev-nosan //caffe2/test/distributed/rpc/jit:rpc_fork && \
buck-out/gen/caffe2/test/distributed/rpc/jit/rpc_fork\#binary.par \
-r test_return_local_script_class_rref_in_py_and_use_in_script
```

Differential Revision: D7923053

fbshipit-source-id: 7b80e0b28aa66343249b8af328ba251314674dcc
2020-03-27 12:41:28 -07:00
835ee34e38 [ROCm] Update to ROCm 3.1.1 (#35552)
Summary:
Redux.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35552

Differential Revision: D20701593

Pulled By: ezyang

fbshipit-source-id: 1946d1e8fb47d597da903bae5d355bf52a5f017f
2020-03-27 12:21:12 -07:00
ff71a4192d Bump base version to 1.6.0a0 (#35495)
Summary:
Since we've done the branch cut for 1.5.0 we should bump nightlies to 1.6.0

Signed-off-by: Eli Uriegas <eliuriegas@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35495

Differential Revision: D20697043

Pulled By: seemethere

fbshipit-source-id: 3646187a5e729994138bf2c68625f25f11430b3a
2020-03-27 12:14:49 -07:00
9e22d15f14 Enable tensorexpr cpp tests in CI. try #2 (#35454)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/35454

Differential Revision: D20665160

Pulled By: Krovatkin

fbshipit-source-id: e04cbe92b2ee5a3288f3c4e5c83533bfea85bf85
2020-03-27 12:09:55 -07:00
930d218fbf Increase Channels Last test coverage (#35504)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/35504

Test Plan: Imported from OSS

Differential Revision: D20682117

Pulled By: VitalyFedyunin

fbshipit-source-id: ddd7ef1f075ea2c5c35df7bd698974fc5c59bc40
2020-03-27 12:04:47 -07:00
3af46c90bd [caffe2] Header path in byte_order.h (#35519)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35519

Fix include of THHalf.h to be TH/THHalf.h. Makes the include consistent with the rest of caffe2.

Test Plan: CI

Differential Revision: D20685997

fbshipit-source-id: 893b6e96e4f1a1e7306ba2e40e4e8ee738f0344f
2020-03-27 11:57:21 -07:00
2c300df2ac [fix] at::print for quantized Tensor (#35545)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35545

Looks like we have never printed a quantized Tensor in cpp before

(Note: this ignores all push blocking failures!)

Test Plan:
.

Imported from OSS

Differential Revision: D20699748

fbshipit-source-id: 9d029815c6e75f626afabf92194154efc83f5545
2020-03-27 11:15:28 -07:00
3cc43bcbb5 Skip slow quanitized tests under ASAN (#35533)
Summary:
Skip tests that take more than finish under a sec normally but take 20+ min under ASAN
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35533

Test Plan: CI

Differential Revision: D20700245

Pulled By: malfet

fbshipit-source-id: 7620b12d3aba1bafb2baa9073fa27c4a0b3dd9eb
2020-03-27 10:55:14 -07:00
0c16cedafe Fix some incorrect annotations found by clang-cl (#35364)
Summary:
Fixes incorrect usages of symbol annotations including:
1. Exporting or importing a function/class in an anonymous namespace.
2. Exporting or importing a function/class implementation in a header file. However, by removing the symbol annotations, they are now local symbols. If they need to be remain global, I can move the implementations to the source file.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35364

Differential Revision: D20670031

Pulled By: ezyang

fbshipit-source-id: cd8018dee703e2424482c27fe9608e040d8105b8
2020-03-27 10:40:04 -07:00
b33e38ec47 Allow a higher-precision step type for Vec256::arange (#34555)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34555

This is sometimes necessary, such as when T=int and the step size is of
type double.

Test Plan: Imported from OSS

Differential Revision: D20687063

Pulled By: ezyang

fbshipit-source-id: 33086d4252d06e7539733a9b1b3d6774e177b6da
2020-03-27 10:22:05 -07:00
5a02930d3a Vectorize (CPU) generic types for binary bitwise operators (#34338)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34338

For those types not optimized for AVX2, this commit would give bitwise
operations on them a boost.

Benchmark (RHEL 7.7, Intel(R) Core(TM) i7-8850H CPU @ 2.60GHz, Turbo off, Release build):

```python
import timeit
for op in ('bitwise_and', 'bitwise_or', 'bitwise_xor'):
    for dtype in ('torch.int8', 'torch.uint8'):
        for n, t in [(10_000, 200000),
                    (100_000, 20000)]:
            print(f'a.{op}_(b), numel() == {n} for {t} times, dtype={dtype}')
            print(timeit.timeit(f'a.{op}_(b)', setup=f'import torch; a = torch.arange(1, {n}, dtype={dtype}); b = torch.arange({n}, 1, -1, dtype={dtype})', number=t))
```

Before:

```
a.bitwise_and_(b), numel() == 10000 for 200000 times, dtype=torch.int8
1.353799690001324
a.bitwise_and_(b), numel() == 100000 for 20000 times, dtype=torch.int8
1.056434961999912
a.bitwise_and_(b), numel() == 10000 for 200000 times, dtype=torch.uint8
1.2957618809996347
a.bitwise_and_(b), numel() == 100000 for 20000 times, dtype=torch.uint8
1.0591609650000464
a.bitwise_or_(b), numel() == 10000 for 200000 times, dtype=torch.int8
1.3113185389993305
a.bitwise_or_(b), numel() == 100000 for 20000 times, dtype=torch.int8
1.0693870880022587
a.bitwise_or_(b), numel() == 10000 for 200000 times, dtype=torch.uint8
1.3075691039994126
a.bitwise_or_(b), numel() == 100000 for 20000 times, dtype=torch.uint8
1.0589785859992844
a.bitwise_xor_(b), numel() == 10000 for 200000 times, dtype=torch.int8
1.3036618039986934
a.bitwise_xor_(b), numel() == 100000 for 20000 times, dtype=torch.int8
1.0595013140009542
a.bitwise_xor_(b), numel() == 10000 for 200000 times, dtype=torch.uint8
1.2947387999993225
a.bitwise_xor_(b), numel() == 100000 for 20000 times, dtype=torch.uint8
1.059969027999614
```

After:

```
a.bitwise_and_(b), numel() == 10000 for 200000 times, dtype=torch.int8
0.9562859639991075
a.bitwise_and_(b), numel() == 100000 for 20000 times, dtype=torch.int8
0.6811799210008758
a.bitwise_and_(b), numel() == 10000 for 200000 times, dtype=torch.uint8
0.9522694869992847
a.bitwise_and_(b), numel() == 100000 for 20000 times, dtype=torch.uint8
0.6815469840003061
a.bitwise_or_(b), numel() == 10000 for 200000 times, dtype=torch.int8
0.8609786279994296
a.bitwise_or_(b), numel() == 100000 for 20000 times, dtype=torch.int8
0.5794818879985542
a.bitwise_or_(b), numel() == 10000 for 200000 times, dtype=torch.uint8
0.8534434389985108
a.bitwise_or_(b), numel() == 100000 for 20000 times, dtype=torch.uint8
0.5764101290005783
a.bitwise_xor_(b), numel() == 10000 for 200000 times, dtype=torch.int8
0.9634105910008657
a.bitwise_xor_(b), numel() == 100000 for 20000 times, dtype=torch.int8
0.6819724230008433
a.bitwise_xor_(b), numel() == 10000 for 200000 times, dtype=torch.uint8
1.0901075929978106
a.bitwise_xor_(b), numel() == 100000 for 20000 times, dtype=torch.uint8
0.816546294001455
```

Test Plan: Imported from OSS

Differential Revision: D20687081

Pulled By: ezyang

fbshipit-source-id: 59b06460430ce181fb761e45a5bdd6379611b391
2020-03-27 10:15:53 -07:00
3c02de0011 copy_ fixed on cuda so removing the workaround in test_many_promotions (#35528)
Summary:
copy_() launch failure fixed on cuda for complex https://github.com/pytorch/pytorch/issues/35344  so removing the workaround added in PR https://github.com/pytorch/pytorch/issues/34093
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35528

Differential Revision: D20693228

Pulled By: anjali411

fbshipit-source-id: dbb6369aa5a21574a0a4fe878ca10e4ecc605f6b
2020-03-27 09:39:46 -07:00
77ad3c5aeb Revert D20683972: [pytorch][PR] Fix PyTorch separate compilation
Test Plan: revert-hammer

Differential Revision:
D20683972

Original commit changeset: bc1492aa9d1d

fbshipit-source-id: 8994cbb36877d4338b8677ac6bc807dd16efa67c
2020-03-27 09:18:48 -07:00
16394a9d3f [caffe2] early return for empty indices in SLS (#35498)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35498

As title

Test Plan:
Need to run remote predictor canary

In SKL T6,

numactl -m 0 -C 3 ./sparse_lengths_sum_benchmark.par -d float -e 100000 --embedding-dim 1 --average-len 0 --batch-size 16 -i 1000000

Before this diff
    0.000302733 ms.        100%. SparseLengthsSum

After this diff
    0.000214509 ms.        100%. SparseLengthsSum

Reviewed By: jianyuh, ellie-wen

Differential Revision: D20678075

fbshipit-source-id: c0c8359036b82ffcbcc8b2a89dfb62db7f0a9c14
2020-03-27 09:10:45 -07:00
25fe7f33ce Add cmakelint to CI (#35525)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/35525

Differential Revision: D20696655

Pulled By: malfet

fbshipit-source-id: 1b15cd730066c8a80440b39110f7f0d51f8ebad0
2020-03-27 09:04:36 -07:00
58f5a89c9a Refactor RoIAlignOp on CPU (#34698)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34698

Refactor RoIAlignOp on CPU

Test Plan: buck test mode/dev-nosan //caffe2/caffe2/python/operator_test:roi_align_rotated_op_test

Reviewed By: houseroad

Differential Revision: D20432434

fbshipit-source-id: 9125eb3bdc83c734222d7d4947c175e3b585afa7
2020-03-27 07:53:58 -07:00
2d023fe6a7 [7] add missing roi_align_rotated op to lite interpreter (#35244)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35244

add roi_align_rotated op to lite interpreter for detectron2go model

(Note: this ignores all push blocking failures!)

Test Plan: try to run model in https://home.fburl.com/~stzpz/text_det/fbnet_300_20/

Reviewed By: iseeyuan

Differential Revision: D20560485

fbshipit-source-id: a81f3a590b9cc5a02d4da676b3cfa52b0e0a68c3
2020-03-27 07:26:02 -07:00
181da12126 Revert D20687652: [pytorch][PR] Report results from cpp unittests on Windows and Linux
Test Plan: revert-hammer

Differential Revision:
D20687652

Original commit changeset: fc370b7e2614

fbshipit-source-id: 8153815c8ed8f3d4f472caa95eda76180b038a42
2020-03-27 06:56:53 -07:00
45e1be9762 Revert D19710370: [pytorch][PR] ONNX Update training ops and training amenable export API
Test Plan: revert-hammer

Differential Revision:
D19710370

Original commit changeset: e5e79d385529

fbshipit-source-id: d0114dc561a3415869805d3fbf43b92730bbcf54
2020-03-27 06:51:05 -07:00
e5cd17cc9e [4] register quantized ops for lite interpreter (#35247)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35247

add a leading "_" to register quantized ops for lite interpreter. They are needed by d2go model

(Note: this ignores all push blocking failures!)

Test Plan:
(whole stack)
buck build -c user.ndk_cxxflags='-g1' -c caffe2.expose_op_to_c10=1 //xplat/caffe2/fb/pytorch_predictor:maskrcnnAndroid#android-armv7

Reviewed By: iseeyuan

Differential Revision: D20528760

fbshipit-source-id: 5b26d075456641b02d82f15a2d19f2266001f23b
2020-03-27 02:26:03 -07:00
025a0abe5a ONNX Update training ops and training amenable export API (#32950)
Summary:
- Update Dropout and Batchnorm in opset 12 : https://github.com/onnx/onnx/pull/2568
- Update api logic for exporting to ONNX training amenable models
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32950

Reviewed By: hl475

Differential Revision: D19710370

Pulled By: houseroad

fbshipit-source-id: e5e79d38552936966662c41d39ddf33be1ba3e35
2020-03-27 00:39:39 -07:00
ac639d927a Reland "[RPC] Use qualified name str directly in RPC torch script code path" (#35489)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35489

Relanding https://github.com/pytorch/pytorch/pull/34733.

Fix is in https://github.com/pytorch/pytorch/pull/34988

Test Plan:
```
buck test mode/dev-nosan //caffe2/test/distributed/rpc:rpc_fork
```

```
buck test mode/dev-nosan //caffe2/test/distributed/rpc/jit:rpc_fork

buck build mode/dev-nosan //caffe2/test/distributed/rpc/jit:rpc_fork && \
buck-out/gen/caffe2/test/distributed/rpc/jit/rpc_fork\#binary.par \
-r test_return_local_script_class_rref_in_py_and_use_in_script

buck build mode/dev-nosan //caffe2/test/distributed/rpc/jit:rpc_fork && \
buck-out/gen/caffe2/test/distributed/rpc/jit/rpc_fork\#binary.par \
-r test_return_local_script_module_rref_in_py_and_use_in_script
```

Differential Revision: D20661748

fbshipit-source-id: d550daab8d689d0a9aa2450f3bdb7417ab79dae2
2020-03-26 23:41:51 -07:00
d2d40c45b6 Report results from cpp unittests on Windows and Linux (#35500)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/35500

Test Plan:
Test in production :)
Results should eventually be published to: https://circleci.com/build-insights/gh/pytorch/pytorch/master

Differential Revision: D20687652

Pulled By: malfet

fbshipit-source-id: fc370b7e261402e14b427f42038ecb2d95bad059
2020-03-26 23:00:33 -07:00
da4e68faed Make operator names consistent between export_opnames and the lite interpreter (#34674)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34674

Two changes to make sure the op_names dumped in export_opnames() are consistent to what are actually used in bytecode.
* Inline graph before dumping the operator names.
* Use code of the graph (which is used in bytecode) instead of the nodes of graph.

Test Plan: Imported from OSS

Differential Revision: D20610715

Pulled By: iseeyuan

fbshipit-source-id: 53fa9c3b36f4f242b7f2b99b421f4adf20d4b1f6
2020-03-26 22:50:59 -07:00
8c90ae11b3 [JIT] fix glow subgraph inputs ordering (#35508)
Summary:
My PR https://github.com/pytorch/pytorch/pull/33020 changed subgraph_utils made subgraph utils non-deterministic by using a set instead of a vector for closed over values. This broke a downstream glow test. We're in the process of working with glow to not rely on the subgraph input order, but in the interim make it ordered again to fix the test.

An alternative is to use a `set` instead of a vector, but I don't particularly like committing to fixed ordering for the subgraph, especially for things like if nodes and while loops where an order doesn't really have any meaning.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35508

Differential Revision: D20683959

Pulled By: eellison

fbshipit-source-id: bb39b29fef2904e52b9dc42be194bb57cbea59c4
2020-03-26 22:44:54 -07:00
bd604cb5b7 Upgrade MKL-DNN to DNNL v1.2 (#32422)
Summary:
## Motivation

This PR upgrades MKL-DNN from v0.20 to DNNL v1.2 and resolves https://github.com/pytorch/pytorch/issues/30300.

DNNL (Deep Neural Network Library) is the new brand of MKL-DNN, which improves performance, quality, and usability over the old version.

This PR focuses on the migration of all existing functionalities, including minor fixes, performance improvement and code clean up. It serves as the cornerstone of our future efforts to accommodate new features like OpenCL support, BF16 training, INT8 inference, etc. and to let the Pytorch community derive more benefits from the Intel Architecture.

<br>

## What's included?

Even DNNL has many breaking changes to the API, we managed to absorb most of them in ideep. This PR contains minimalist changes to the integration code in pytorch. Below is a summary of the changes:

<br>

**General:**

1. Replace op-level allocator with global-registered allocator

```
// before
ideep::sum::compute<AllocForMKLDNN>(scales, {x, y}, z);

// after
ideep::sum::compute(scales, {x, y}, z);
```

The allocator is now being registeted at `aten/src/ATen/native/mkldnn/IDeepRegistration.cpp`. Thereafter all tensors derived from the `cpu_engine` (by default) will use the c10 allocator.

```
RegisterEngineAllocator cpu_alloc(
  ideep::engine::cpu_engine(),
  [](size_t size) {
    return c10::GetAllocator(c10::DeviceType::CPU)->raw_allocate(size);
  },
  [](void* p) {
    c10::GetAllocator(c10::DeviceType::CPU)->raw_deallocate(p);
  }
);
```
------

2. Simplify group convolution

We had such a scenario in convolution where ideep tensor shape mismatched aten tensor: when `groups > 1`, DNNL expects weights tensors to be 5-d with an extra group dimension, e.g. `goihw` instead of `oihw` in 2d conv case.

As shown below, a lot of extra checks came with this difference in shape before. Now we've completely hidden this difference in ideep and all tensors are going to align with pytorch's definition. So we could safely remove these checks from both aten and c2 integration code.

```
// aten/src/ATen/native/mkldnn/Conv.cpp

if (w.ndims() == x.ndims() + 1) {
  AT_ASSERTM(
      groups > 1,
      "Only group _mkldnn_conv2d weights could have been reordered to 5d");
  kernel_size[0] = w.get_dim(0) * w.get_dim(1);
  std::copy_n(
      w.get_dims().cbegin() + 2, x.ndims() - 1, kernel_size.begin() + 1);
} else {
  std::copy_n(w.get_dims().cbegin(), x.ndims(), kernel_size.begin());
}
```

------

3. Enable DNNL built-in cache

Previously, we stored DNNL jitted kernels along with intermediate buffers inside ideep using an LRU cache. Now we are switching to the newly added DNNL built-in cache, and **no longer** caching buffers in order to reduce memory footprint.

This change will be mainly reflected in lower memory usage from memory profiling results. On the code side, we removed couple of lines of `op_key_` that depended on the ideep cache before.

------

4. Use 64-bit integer to denote dimensions

We changed the type of `ideep::dims` from `vector<int32_t>` to `vector<int64_t>`. This renders ideep dims no longer compatible with 32-bit dims used by caffe2. So we use something like `{stride_.begin(), stride_.end()}` to cast parameter `stride_` into a int64 vector.

<br>

**Misc changes in each commit:**

**Commit:** change build options

Some build options were slightly changed, mainly to avoid name collisions with other projects that include DNNL as a subproject. In addition, DNNL built-in cache is enabled by option `DNNL_ENABLE_PRIMITIVE_CACHE`.

Old | New
-- | --
WITH_EXAMPLE | MKLDNN_BUILD_EXAMPLES
WITH_TEST | MKLDNN_BUILD_TESTS
MKLDNN_THREADING | MKLDNN_CPU_RUNTIME
MKLDNN_USE_MKL | N/A (not use MKL anymore)

------

**Commit:** aten reintegration

- aten/src/ATen/native/mkldnn/BinaryOps.cpp

    Implement binary ops using new operation `binary` provided by DNNL

- aten/src/ATen/native/mkldnn/Conv.cpp

    Clean up group convolution checks
    Simplify conv backward integration

- aten/src/ATen/native/mkldnn/MKLDNNConversions.cpp

    Simplify prepacking convolution weights

- test/test_mkldnn.py

    Fixed an issue in conv2d unit test: it didn't check conv results between mkldnn and aten implementation before. Instead, it compared the mkldnn with mkldnn as the default cpu path will also go into mkldnn. Now we use `torch.backends.mkldnn.flags` to fix this issue

- torch/utils/mkldnn.py

    Prepack weight tensor on module `__init__` to achieve better performance significantly

------

**Commit:** caffe2 reintegration

- caffe2/ideep/ideep_utils.h

    Clean up unused type definitions

- caffe2/ideep/operators/adam_op.cc & caffe2/ideep/operators/momentum_sgd_op.cc

   Unify tensor initialization with `ideep::tensor::init`. Obsolete `ideep::tensor::reinit`

- caffe2/ideep/operators/conv_op.cc & caffe2/ideep/operators/quantization/int8_conv_op.cc

    Clean up group convolution checks
    Revamp convolution API

- caffe2/ideep/operators/conv_transpose_op.cc

    Clean up group convolution checks
    Clean up deconv workaround code

------

**Commit:** custom allocator

- Register c10 allocator as mentioned above

<br><br>

## Performance

We tested inference on some common models based on user scenarios, and most performance numbers are either better than or on par with DNNL 0.20.

ratio: new / old | Latency (batch=1 4T) | Throughput (batch=64 56T)
-- | -- | --
pytorch resnet18 | 121.4% | 99.7%
pytorch resnet50 | 123.1% | 106.9%
pytorch resnext101_32x8d | 116.3% | 100.1%
pytorch resnext50_32x4d | 141.9% | 104.4%
pytorch mobilenet_v2 | 163.0% | 105.8%
caffe2 alexnet | 303.0% | 99.2%
caffe2 googlenet-v3 | 101.1% | 99.2%
caffe2 inception-v1 | 102.2% | 101.7%
caffe2 mobilenet-v1 | 356.1% | 253.7%
caffe2 resnet101 | 100.4% | 99.8%
caffe2 resnet152 | 99.8% | 99.8%
caffe2 shufflenet | 141.1% | 69.0% †
caffe2 squeezenet | 98.5% | 99.2%
caffe2 vgg16 | 136.8% | 100.6%
caffe2 googlenet-v3 int8 | 100.0% | 100.7%
caffe2 mobilenet-v1 int8 | 779.2% | 943.0%
caffe2 resnet50 int8 | 99.5% | 95.5%

_Configuration:
Platform: Skylake 8180
Latency Test: 4 threads, warmup 30, iteration 500, batch size 1
Throughput Test: 56 threads, warmup 30, iteration 200, batch size 64_

† Shufflenet is one of the few models that require temp buffers during inference. The performance degradation is an expected issue since we no longer cache any buffer in the ideep. As for the solution, we suggest users opt for caching allocator like **jemalloc** as a drop-in replacement for system allocator in such heavy workloads.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32422

Test Plan:
Perf results: https://our.intern.facebook.com/intern/fblearner/details/177790608?tab=Experiment%20Results

10% improvement for ResNext with avx512, neutral on avx2

More results: https://fb.quip.com/ob10AL0bCDXW#NNNACAUoHJP

Reviewed By: yinghai

Differential Revision: D20381325

Pulled By: dzhulgakov

fbshipit-source-id: 803b906fd89ed8b723c5fcab55039efe3e4bcb77
2020-03-26 22:07:59 -07:00
8240db11e1 [pytorch] Remove python2 support from tests and torch.jit (#35042)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35042

Removing python2 tests and some compat code in torch.jit. Check if dependent projects and external tests have any issues after these changes.

Test Plan: waitforsandcastle

Reviewed By: suo, seemethere

Differential Revision: D18942633

fbshipit-source-id: d76cc41ff20bee147dd8d44d70563c10d8a95a35
2020-03-26 21:29:51 -07:00
98362d11ff [rpc] create error string in listenLoop outside of lock (#35393)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35393

this was being created inside the lock scope, but we don't need to
hold the lock for this.
ghstack-source-id: 100953426

Test Plan: CI

Differential Revision: D20632225

fbshipit-source-id: dbf6746f638b7df5fefd9bbfceaa6b1a542580e2
2020-03-26 20:57:01 -07:00
77bbbf042d [JIT]Support converting str to float. (#35352)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/35352

Differential Revision: D20649286

Pulled By: ailzhang

fbshipit-source-id: e9b09bddd0fe3c962a7514d45fd069cd0b4e6df1
2020-03-26 20:24:59 -07:00
00a261fddd [pytorch] add fallthrough variable kernel for C10_MOBILE (#35491)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35491

The goal of this diff is to avoid having to set AutoNonVariableTypeMode guard
in client code that uses custom mobile build. The guard was necessary because
custom mobile build might not include variable kernels, in which AutoNonVariableTypeMode
guard is usually set. It's hard to enforce all callsites to follow this rule, so
we make this change to simplify it.
Another goal of the diff is to not break FL where real variable kernels are
registered.
ghstack-source-id: 100944553

Test Plan:
- With stacked diff, tested lite-trainer with MnistModel:
```
buck run xplat/caffe2/fb/lite_trainer:lite_trainer \
-c pt.disable_gen_tracing=1 \
-- --model=/home/liujiakai/ptmodels/MnistModel.bc
```
- Will test with the papaya sample app.

Differential Revision: D20643627

fbshipit-source-id: 37ea937919259c183809c2b7acab0741eff84d33
2020-03-26 20:08:05 -07:00
f5383a213f Fix openmp detection with clang-cl (#35365)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/35365

Differential Revision: D20653049

Pulled By: ezyang

fbshipit-source-id: 193c0d956b1aea72b3daa104ef49c4bf167a165a
2020-03-26 19:59:53 -07:00
5371fdb1a0 [C++ API Parity] [Optimizers] Merged Optimizer and LossClosureOptimizer (#34957)
Summary:
1. Removed LossClosureOptimizer, and merged Optimizer into OptimizerBase (and renamed the merged class to Optimizer)
2. Merged the LBFGS-specific serialize test function and the generic test_serialize_optimizer function.
3. BC-compatibility serialization test for LBFGS
4. Removed mentions of parameters_ in optimizer.cpp, de-virtualize all functions
5. Made defaults_ optional argument in all optimizers except SGD

**TODO**: add BC-breaking notes for this PR

Pull Request resolved: https://github.com/pytorch/pytorch/pull/34957

Test Plan: Imported from GitHub, without a `Test Plan:` line.

Differential Revision: D20678162

Pulled By: yf225

fbshipit-source-id: 74e062e42d86dc118f0fbaddd794e438b2eaf35a
2020-03-26 19:53:02 -07:00
e68afe3ab9 [JIT] remove prim::shape op (#34286)
Summary:
Desugar prim::shape to aten::size so that passes don't need to reason about both ops. Serialized models still resolve to `prim::shape` so this doesn't break BC.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34286

Differential Revision: D20316818

Pulled By: eellison

fbshipit-source-id: d1585687212843f51e9396e07c108f5c08017818
2020-03-26 19:29:25 -07:00
8f18cdf2b8 [Autograd Testing] Few refactors to test_autograd.py (#35443)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35443

Addressing Wanchao's comments from
https://github.com/pytorch/pytorch/pull/35268.
ghstack-source-id: 100944390

Test Plan: waitforbuildbot

Differential Revision: D20662292

fbshipit-source-id: d98bf27106e858fe81e0f7755639c7da0f322913
2020-03-26 18:57:52 -07:00
5d9694250c Updating submodules
Summary:
GitHub commits:

6a867586ed
bf0ba207b5
b90f25fcfe
ea2ad0ad00
f32a0cc4a7
23826a3f97
6301dbe7a7
3332b50f59
b6cf025c4f
683abef629
099bb93f87
10214d1d1b
5b848ab61d
a6e81fb889

Test Plan: n/a

Reviewed By: zpao

fbshipit-source-id: cfd395231e68b7d026fce966bcb8cddf10996770
2020-03-26 18:51:35 -07:00
9970be2fd2 Update git-pre-commit (#35511)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/35511

Differential Revision: D20684849

Pulled By: suo

fbshipit-source-id: e059e15230d1a4064f45df5c7895b220c9cc20d9
2020-03-26 18:45:33 -07:00
9b4bbaab53 Add RRef.local_value() for TorchScript (#35433)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35433

Make RRef TorchScript API the same as RRef Python API.

Differential Revision: D7923050

fbshipit-source-id: 62589a429bcaa834b55db6ae8cfb10c0a2ee01ff
2020-03-26 18:06:13 -07:00
d4f3bc7f8e [dt] [caffe2] add/fix shape inference for StumpFunc, SliceGradient and ResizeLike (#35430)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35430

This fixes and adds tests for several commonly used operators.

There's some formatting differences due to running clang-format on one of the files.

Test Plan: buck test //caffe2/caffe2/fb/operators:hypothesis_test //caffe2/caffe2/python/operator_test:utility_ops_test //caffe2/caffe2/python/operator_test:concat_split_op_test

Reviewed By: yyetim

Differential Revision: D20657405

fbshipit-source-id: 51d86d0834003b8ac8d6acb5149ae13d7bbfc6ab
2020-03-26 17:50:32 -07:00
2e739f822b Fix PyTorch separate compilation (#34863)
Summary:
Looks like there is a bug in CUDA device linker, but kernels that uses `thust::sort_by_key` can not be linked with other kernels
    Solve the problem by splitting 5 thrust-heavy .cu files into `__torch_cuda_sp` library which is statically linked into `torch_cuda`
    For default compilation workflow it should not make any difference.

    Test Plan: Compile with `-DCUDA_SEPARABLE_COMPILATION=YES` and observe library size difference: 310Mb before, 173Mb after if compiled for sm_75
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34863

Differential Revision: D20683972

Pulled By: malfet

fbshipit-source-id: bc1492aa9d1d2d21c48e8764a8a7b403feaec5da
2020-03-26 17:49:07 -07:00
2f6f1781af Add warning to a known autograd issue on XLA backend. (#35449)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/35449

Differential Revision: D20676835

Pulled By: ailzhang

fbshipit-source-id: c351eb5650ff09654f7c2e3588dfea19dcde3856
2020-03-26 17:44:12 -07:00
8074779328 [quant][graph] Update dynamic quant tests to use new qconfig (#35451)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35451

default_dynamic_qconfig now holds activation observer

Test Plan:
python test/test_quantize_script.py

Imported from OSS

Differential Revision: D20664585

fbshipit-source-id: 78cb6747705d230d2bbcfdae59210b4b998d0d15
2020-03-26 17:39:49 -07:00