Compare commits

..

5137 Commits

Author SHA1 Message Date
7c7c9c3aa6 scatter/gather - check that inputs are of the same dimensionality (#41890)
Co-authored-by: Nikita Vedeneev <nik@quansight.com>
2020-07-22 18:33:07 -07:00
a2922f589d [1.6.0] Mark torch.set_deterministic and torch.is_deterministic as experimental (#41870)
This PR:
- renames `torch.set_deterministic` to `torch._set_deterministic`
- renames `torch.is_deterministic` to `torch._is_deterministic`
- Modifies the docstrings for both to indicate that the feature is not
yet complete.

We would like to do this because this feature is experimental and the
docstrings before this PR are misleading.

This PR does not have an accompanying change in master. That is because
there still is discussion over what the eventual state of the feature
should be: https://github.com/pytorch/pytorch/issues/15359. I expect
that there will be a better plan for this once 1.7 rolls around.

Test Plan:
- wait for CI
2020-07-22 18:32:47 -07:00
8acfecaecb [1.6] Add optimizer_for_mobile doc into python api root doc (#41491)
* Add optimizer_for_mobile doc into python api root doc

* Apply suggestions from code review

Remove all references to `optimization_blacklist` as it's missing in 1.6

Co-authored-by: Nikita Shulga <nshulga@fb.com>
2020-07-22 17:37:45 -07:00
860e18a61b Update torch.set_default_dtype doc (#41263)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/41263

Test Plan: Imported from OSS

Differential Revision: D22482989

Pulled By: anjali411

fbshipit-source-id: 2aadfbb84bbab66f3111970734a37ba74d817ffd
2020-07-22 14:50:15 -07:00
8f804baaa9 Doc note for complex (#41252)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/41252

Test Plan: Imported from OSS

Reviewed By: albanD

Differential Revision: D22553266

Pulled By: anjali411

fbshipit-source-id: f6dc409da048496d72b29b0976dfd3dd6645bc4d
2020-07-22 14:49:51 -07:00
a395e0903e Autograd Doc for Complex Numbers (#41012)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/41012

Test Plan: Imported from OSS

Differential Revision: D22476911

Pulled By: anjali411

fbshipit-source-id: 7da20cb4312a0465272bebe053520d9911475828
2020-07-22 14:40:52 -07:00
2ca55430d2 Add reference documentation for torch/library.h (#41470) (#41602)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41470

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Reviewed By: zou3519

Differential Revision: D22577426

Pulled By: ezyang

fbshipit-source-id: 4bfe5806061e74181a74d161c868acb7c1ecd1e4
2020-07-22 11:10:16 -07:00
b8e77a42bd Add CUDA11 build and test (#40452) (#41543)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/40452

Differential Revision: D22316007

Pulled By: malfet

fbshipit-source-id: 94f4b4ba2a46ff3d3042ba842a615f8392cdc350

Co-authored-by: Gao, Xiang <qasdfgtyuiop@gmail.com>
2020-07-22 09:53:22 -07:00
4081fdd3df Revert "port masked_select from TH to ATen and optimize perf on CPU (#33269)" (#41829)
This reverts commit fe66bdb498efe912d8b9c437a14efa4295c04fdd.

This also makes a sense to THTensorEvenMoreMath because sumall was removed, see THTensor_wrap.
2020-07-22 09:52:30 -07:00
cefb9e0cd6 Update pthreadpool to pthreadpool:029c88620802e1361ccf41d1970bd5b07fd6b7bb. (#40524) (#41190)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/40524

Reviewed By: ezyang

Differential Revision: D22215742

Pulled By: AshkanAliabadi

fbshipit-source-id: ef594e0901337a92b21ddd44e554da66c723eb7c
2020-07-10 09:11:32 -07:00
d9e9e0087a [v1.6] [RPC docs] Remove mention of TensorPipe's SHM and CMA backends as they're not built (#41229)
Summary:
In short, we messed up. The SHM and CMA backends of TensorPipe are Linux-specific and thus they are guarded by a #ifdef in the agent's code. Due to a mishap with CMake (due the fact that TensorPipe has two CMake files, one for PyTorch and a "standalone" one) we were not correctly propagating some flags and these #ifdefs were always false. This means that these two backends have always been disabled and have thus never been covered by our OSS CI. It would be irresponsible to enable them now in v1.6, so instead we remove any mention of them from the docs.

Note that this is perhaps not as bad as it sounds. These two backends were providing higher performance (latency) when the two endpoints were on the same machine. However, I suspect that most RPC users will only do transfers across machines, for which SHM and CMA wouldn't have played any role.

Original PR against master: #41200 (merged as dde3d5f4a8f713ecc4649d776565b68ca75ae5c8)

Test Plan: Docs only
2020-07-10 09:02:08 -07:00
43d746305c Preserve CUDA gencode flags (#41212)
Summary:
Add `torch._C._cuda_getArchFlags()` that returns list of architecture `torch_cuda` were compiled with
Add `torch.cuda.get_arch_list()` and `torch.cuda.get_gencode_flags()` methods that returns architecture list and gencode flags PyTorch were compiled with
Print warning if some of GPUs is not compatible with any of the CUBINs

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41173

Differential Revision: D22459998

Pulled By: malfet

fbshipit-source-id: 65d40ae29e54a0ba0f3f2da11b821fdb4d452d95
2020-07-09 17:34:50 -07:00
9409e03903 [ONNX][1.6] Update interpolate recompute_scale_factor default (#41117)
* Update interpolate recompute_scale_factor default

* Update upsampling.h

* Update functional.py
2020-07-09 17:24:53 -07:00
c9a1853d2f [1.6] Make IterableDataset DataLoader.__len__ warning clearer (#41185)
* make IterableDataset DataLoader.__len__ warning clearer

* typo
2020-07-09 14:07:58 -07:00
7fa9b2923b quantizer.cpp: fix cuda memory pinning (#41139) (#41194)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41139

Fixes the test case in https://github.com/pytorch/pytorch/issues/41115
by using PyTorch's CUDA allocator instead of the old Caffe2 one.

Test Plan:
run the test case from the issue:
https://gist.github.com/vkuzo/6d013aa1645cb986d0d4464a931c779b

let's run CI and see what it uncovers

Imported from OSS

Reviewed By: malfet

Differential Revision: D22438787

fbshipit-source-id: 0853b0115d198a99c43e6176aef34ea951bf5c2e

Co-authored-by: Vasiliy Kuznetsov <vasiliy@fb.com>
2020-07-09 14:06:11 -07:00
40bf15a8ac Remove copy_ warnings for angle and abs for complex tensors (#41152) (#41191)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41152

fixes https://github.com/pytorch/pytorch/issues/40838

Test Plan: Imported from OSS

Differential Revision: D22444357

Pulled By: anjali411

fbshipit-source-id: 2879d0cffc0a011c624eb8e00c7b64bd33522cc3

Co-authored-by: anjali411 <chourdiaanjali123@gmail.com>
2020-07-09 13:41:15 -07:00
c164fc4d7f Patch #40883 to 1.6 release. (#41033) 2020-07-09 10:25:39 -07:00
e0b7480f34 Revert "make IterableDataset DataLoader.__len__ warning clearer (#41183)"
This reverts commit 89d7f194d8ea19f36c9afb52585a00b5b7d0ffeb.
2020-07-09 08:05:24 -07:00
89d7f194d8 make IterableDataset DataLoader.__len__ warning clearer (#41183) 2020-07-09 08:00:00 -07:00
59bb44a8e8 Add a link in RPC doc page to point to PT Distributed overview (#41108) (#41156)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/41108

Test Plan: Imported from OSS

Differential Revision: D22440751

Pulled By: mrshenli

fbshipit-source-id: 9e7b002091a3161ae385fdfcc26484ae8fc243bb
2020-07-09 07:49:10 -07:00
8f4d01d9f1 Disables unary op casting to output dtype (#41097) (#41160)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/41047.

Some CPU kernel implementations don't call `cast_outputs()`, so when CPU temporaries were created to hold their outputs they weren't copied back to the out parameters correctly. Instead of fixing that issue, for simplicity this PR disables the behavior. The corresponding test in test_type_promotion.py is expanded with more operations to verify that unary ops can no longer have out arguments with different dtypes than their inputs (except in special cases like torch.abs which maps complex inputs to float outputs and torch.deg2rad which is secretly torch.mul).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41097

Differential Revision: D22422352

Pulled By: mruberry

fbshipit-source-id: 8e61d34ef1c9608790b35cf035302fd226fd9421

Co-authored-by: Mike Ruberry <mruberry@devfair044.maas>
2020-07-08 22:06:48 -07:00
77ffb25925 Add guard for non-default stream in DDP's autograd engine callback (#40115) (#41151)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40115

Closes https://github.com/pytorch/pytorch/issues/37790
Closes https://github.com/pytorch/pytorch/issues/37944

A user may wish to run DDP's forward + backwards step under a non-default CUDA stream such as those created by `with torch.cuda.Stream(stream)`. In this case, the user should be responsible for synchronizing events on this stream with other streams used in the program (per the documentation at https://pytorch.org/docs/stable/notes/cuda.html#cuda-semantics), but currently DDP has a bug which causes DDP under non-default streams to fail.

If a user does the following:
```
model = DDP(...)
loss = model(inptut).sum()
loss.backward()
grad = model.module.weight.grad()
average = dist.all_reduce(grad)
```

There is a chance that `average` and `grad` will not be equal. This is because the CUDA kernels corresponding to the  `all_reduce` call may run before `loss.backward()`'s kernels are finished. Specifically, in DDP we copy the allreduced gradients back to the model parameter gradients in an autograd engine callback, but this callback runs on the default stream. Note that this can also be fixed by the application synchronizing on the current stream, although this should not be expected, since the application is not using the current stream at all.

This PR fixes the issue by passing the current stream into DDP's callback.

Tested by adding a UT `test_DistributedDataParallel_non_default_stream` that fails without this PR
ghstack-source-id: 106481208

Differential Revision: D22073353

fbshipit-source-id: 70da9b44e5f546ff8b6d8c42022ecc846dff033e
2020-07-08 21:08:17 -07:00
af9600b1f5 [Caffe2] Move in-header virtual function implementation to .cc files (#41090)
* Move OperatorSchema default inference function implementations to .cc… (#40845)

Summary:
… file

This prevents implementation of those functions(as lambdas) to be embedded as weak symbol into every shared library that includes this header.

Combination of this and https://github.com/pytorch/pytorch/pull/40844 reduces size of `libcaffe2_module_test_dynamic.so` from 500kb to 50Kb.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40845

Differential Revision: D22334779

Pulled By: malfet

fbshipit-source-id: 64706918fc2947350a58c0877f294b1b8b085455

* Move `OperatorBase::AddRelatedBlobInfo` implementation to .cc file (#40844)

Summary:
If virtual function is implemented in header file, it's implementation will be included as a weak symbol to every shared library that includes this header along with all of it's dependencies.

This was one of the reasons why size of libcaffe2_module_test_dynamic.so  was 500Kb (AddRelatedBlobInfo implementation pulled a quarter of libprotobuf.a with it)

Combination of this and https://github.com/pytorch/pytorch/issues/40845 reduces size of `libcaffe2_module_test_dynamic.so` from 500kb to 50Kb.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40844

Differential Revision: D22334725

Pulled By: malfet

fbshipit-source-id: 836a4cbb9f344355ddd2512667e77472546616c0
2020-07-07 21:17:11 -07:00
83262b1ba1 torch._six.PY37 should be true for Python-3.8 as well (#40868) (#41091)
Summary:
Right now it is used to check whether `math.remainder` exists, which is the case for both Python-3.7 and 3.8
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40868

Differential Revision: D22343454

Pulled By: malfet

fbshipit-source-id: 6b6d4869705b64c4b952309120f92c04ac7e39fd
2020-07-07 17:15:01 -07:00
f862a6ba4d Remove unused Logger in get_matching_activations (#41023) (#41087)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41023

Remove Logger in get_matching_activations since it's not used.
ghstack-source-id: 107237046

Test Plan:
buck test mode/dev caffe2/test:quantization -- 'test_compare_weights_lstm_dynamic'
buck test mode/dev caffe2/test:quantization -- 'test_compare_model_stub_lstm_dynamic'
buck test mode/dev caffe2/test:quantization -- 'test_compare_model_outputs_lstm_dynamic'
buck test mode/dev caffe2/test:quantization -- 'test_compare_weights_conv_static'
buck test mode/dev caffe2/test:quantization -- 'test_compare_weights_linear_static'
buck test mode/dev caffe2/test:quantization -- 'test_compare_weights_linear_dynamic'
buck test mode/dev caffe2/test:quantization -- 'test_compare_model_stub_conv_static'
buck test mode/dev caffe2/test:quantization -- 'test_compare_model_stub_linear_static'
buck test mode/dev caffe2/test:quantization -- 'test_compare_model_stub_submodule_static'
buck test mode/dev caffe2/test:quantization -- 'test_compare_model_stub_functional_static'
buck test mode/dev caffe2/test:quantization -- 'test_compare_model_stub_linear_dynamic'
buck test mode/dev caffe2/test:quantization -- 'test_compare_model_outputs_conv_static'
buck test mode/dev caffe2/test:quantization -- 'test_compare_model_outputs_linear_static'
buck test mode/dev caffe2/test:quantization -- 'test_compare_model_outputs_functional_static'
buck test mode/dev caffe2/test:quantization -- 'test_compare_model_outputs_linear_dynamic'

Differential Revision: D22394957

fbshipit-source-id: 7d59e0f35e9f4c304b8487460d48236ee6e5a872

Co-authored-by: Haixin Liu <haixin@fb.com>
2020-07-07 16:09:37 -07:00
f3c1ea7455 [PyTorch Numeric Suite] Remove unnecessary Logger in input arguments (#40890) (#41086)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40890

Remove unnecessary Logger in input arguments and simplify the API.
ghstack-source-id: 107110487

Test Plan:
buck test mode/dev caffe2/test:quantization -- 'test_compare_weights_lstm_dynamic'
buck test mode/dev caffe2/test:quantization -- 'test_compare_model_stub_lstm_dynamic'
buck test mode/dev caffe2/test:quantization -- 'test_compare_model_outputs_lstm_dynamic'
buck test mode/dev caffe2/test:quantization -- 'test_compare_weights_conv_static'
buck test mode/dev caffe2/test:quantization -- 'test_compare_weights_linear_static'
buck test mode/dev caffe2/test:quantization -- 'test_compare_weights_linear_dynamic'
buck test mode/dev caffe2/test:quantization -- 'test_compare_model_stub_conv_static'
buck test mode/dev caffe2/test:quantization -- 'test_compare_model_stub_linear_static'
buck test mode/dev caffe2/test:quantization -- 'test_compare_model_stub_submodule_static'
buck test mode/dev caffe2/test:quantization -- 'test_compare_model_stub_functional_static'
buck test mode/dev caffe2/test:quantization -- 'test_compare_model_stub_linear_dynamic'
buck test mode/dev caffe2/test:quantization -- 'test_compare_model_outputs_conv_static'
buck test mode/dev caffe2/test:quantization -- 'test_compare_model_outputs_linear_static'
buck test mode/dev caffe2/test:quantization -- 'test_compare_model_outputs_functional_static'
buck test mode/dev caffe2/test:quantization -- 'test_compare_model_outputs_linear_dynamic'

Differential Revision: D22345477

fbshipit-source-id: d8b4eb3d6cb3049aa3296dead8ba29bf5467bd1c

Co-authored-by: Haixin Liu <haixin@fb.com>
2020-07-07 16:09:11 -07:00
2ed3ad2891 fix autodoc for torch.distributed.launch (#40963) (#41089)
Summary:
The doc for `torch.distributed.launch` is missing since v1.2.0 (see issue https://github.com/pytorch/pytorch/issues/36386) because PR https://github.com/pytorch/pytorch/issues/22501 added some imports at the first line.
542ac74987/torch/distributed/launch.py (L1-L5)
I move it below the docstring to make the autodoc in Sphinx work normally.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/40963

Differential Revision: D22380816

Pulled By: mrshenli

fbshipit-source-id: ee8406785b9a198bbf3fc65e589854379179496f

Co-authored-by: Xin Yao <yaox12@outlook.com>
2020-07-07 14:23:31 -07:00
a857af50a4 [quant][graphmode][fix] cloning schema in insert_observers (#40624) (#40934)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40624

Previously we didn't clone schema, so the default schema is used, this is
causing issue for some models

Test Plan: Imported from OSS

Differential Revision: D22259519

fbshipit-source-id: e2a393a54cb18f55da0c7152a74ddc22079ac350
2020-07-07 13:27:36 -07:00
d0045e5520 Some fixes for graph mode quantization (#40935)
* [quant] aten::repeat work for quantized tensor (#40644)

Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/40644

Test Plan: Imported from OSS

Differential Revision: D22268558

fbshipit-source-id: 3bc9a129bece1b547c519772ecc6b980780fb904

* [quant][graphmode][fix] remove unsupported ops in the list (#40653)

Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40653

(Note: this ignores all push blocking failures!)

Test Plan: Imported from OSS

Differential Revision: D22271413

fbshipit-source-id: a01611b5d90849ac673fa5a310f910c858e907a3
2020-07-07 13:26:27 -07:00
0406b69b79 [quant][graphmode][fix] Fold conv bn (#40865) (#40970)
* [quant][graphmode][fix] Fold conv bn (#40865)

Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40865

1. applied filter for the module types
2. removed the assumption that the conv bn are immediate child of parent module

Test Plan:
python test/test_quantization.py TestQuantizeJitPasses

Imported from OSS

Differential Revision: D22338074

fbshipit-source-id: 64739a5e56c0a74249a1dbc2c8454b88ec32aa9e

* [quant][graphmode][fix] Print the node in error message (#40889)

Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/40889

Test Plan: Imported from OSS

Differential Revision: D22348266

fbshipit-source-id: eed2ece5c94fcfaf187d6770bed4a7109f0c0b4a
2020-07-07 13:25:39 -07:00
6220cc4380 [quant][graphmode][fix] dequantize propagation for {add/mul}_scalar + aten::repeat (#40933)
* [quant][graphmode][fix] dequantize propagation for {add/mul}_scalar (#40596)

Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40596

Previously the fusion patterns for {add/mul}_scalar is inconsistent since the op pattern
produces a non-quantized tensor and the op replacement graph produces a quantized tensor

Test Plan: Imported from OSS

Differential Revision: D22251072

fbshipit-source-id: e16eb92cf6611578cca1ed8ebde961f8d0610137

* [quant][graphmode] Support quantization for `aten::apend` (#40743)

Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40743

`aten::append` modifies input inplace and the output is ignored, these ops are not
supported right now, so we'll need to first make `aten::append` non-inplace
by change
```
ignored = aten::append(list, x)
```
to
```
x_list = aten::ListConstruct(x)
result = aten::add(list, x_list)
```
and then quantize the aten::add instead.

Test Plan:
TestQuantizeJitOps.test_general_shape_ops

Imported from OSS

Differential Revision: D22302151

fbshipit-source-id: 931000388e7501e9dd17bec2fad8a96b71a5efc5
2020-07-07 13:25:02 -07:00
eaf3f2fd34 Added index_put to promotelist (#41036)
* Added index_put to promotelist

* docstring

Co-authored-by: Michael Carilli <mcarilli@nvidia.com>
2020-07-07 13:00:32 -07:00
c35b4c770b Bucket of shape analysis fixes (#41044)
* [JIT] fix unfold shape analysis (#40749)

Summary:
unfold on a 0-dimensioned tensor returns a 1-dim tensor
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40749

Differential Revision: D22361481

Pulled By: eellison

fbshipit-source-id: 621597e5f97f6e39953eb86f8b85bb4142527a9f

* shape analysis fix for default dtype'

ghstack-source-id: 723aa27c2685417715a0891f5ca1ae885d4c9832
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40938

* fix grad thrashing of shape analysis

ghstack-source-id: dd8742b1da52d17e9d6ab6c81ff0b27520b09417
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40939

Co-authored-by: Elias Ellison <eellison@fb.com>
2020-07-07 12:59:47 -07:00
11baccf1b5 [release/1.6] .circleci: Output binary sizes, store binaries (#41075)
We need an easy to way to quickly visually grep binary sizes from builds
and then have a way to test out those binaries quickly.

Signed-off-by: Eli Uriegas <eliuriegas@fb.com>
(cherry picked from commit 66813515d4dec66f319442ba967c64b87c0286cd)
Signed-off-by: Eli Uriegas <eliuriegas@fb.com>
2020-07-07 11:27:00 -07:00
f0f0cbdd4a Docstring changes for dynamic quantized classes (#40931) (#41032)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40931

Fix docstrings for dynamic quantized Linear/LSTM and associated classes
ghstack-source-id: 107064446

Test Plan: Docs show up in correctly

Differential Revision: D22360787

fbshipit-source-id: 8e357e081dc59ee42fd7f12ea5079ce5d0cc9df2
2020-07-06 21:37:53 -07:00
11b70b0041 [JIT] Switch executor from Simple to Legacy. (#41017)
* properly skip legacy tests regardless of the default executor (#40381)

Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/40381

Differential Revision: D22173938

Pulled By: Krovatkin

fbshipit-source-id: 305fc4484977e828cc4cee6e053a1e1ab9f0d6c7

* [JIT] Switch executor from Simple to Legacy.

This is done for 1.6 only in order to recover performance regressions
caused by the Legacy->Simple switch that was done in 1.5. On master we
still plan to use Simple executor and fix the performance issues in 1.7
without falling back to the Legacy executor.

Co-authored-by: Nikolay Korovaiko <korovaikon@gmail.com>
2020-07-06 21:35:02 -07:00
01e9562313 [1.6 cherrypick] Fix delegating to jit.load from torch.load (#41013) 2020-07-06 16:55:00 -07:00
3f13c9a2c8 infer tensor properties based on an input tensor rather than defaults for xxx_like ctors (#40895) (#41016)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/40895

Reviewed By: eellison

Differential Revision: D22358878

Pulled By: Krovatkin

fbshipit-source-id: 2db2429aa89c180d8e52a6bb1265308483da46a2
2020-07-06 16:52:59 -07:00
63a94c021a shape inference of undefined for prim::grad (#40866) (#41015)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/40866

Reviewed By: pbelevich

Differential Revision: D22358988

Pulled By: Krovatkin

fbshipit-source-id: 7118d7f8d4eaf056cfb71dc0d588d38b1dfb0fc7
2020-07-06 16:51:37 -07:00
2b175ba909 update requires_gard on loop inputs correctly (master) (#40926) (#41014)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/40926

Reviewed By: eellison

Differential Revision: D22359471

Pulled By: Krovatkin

fbshipit-source-id: 823e87674e2d2917f075255ec926e0485972f4e2
2020-07-06 16:30:14 -07:00
8c3f662224 Update FP16 to FP16:4dfe081cf6bcd15db339cf2680b9281b8451eeb3. (#40956) 2020-07-06 06:59:41 -07:00
0ffdd5aa1d Update cpuinfo to cpuinfo:63b254577ed77a8004a9be6ac707f3dccc4e1fd9. (#40955) 2020-07-06 06:59:30 -07:00
d53427c541 Update FXdiv to FXdiv:b408327ac2a15ec3e43352421954f5b1967701d1. (#40954) 2020-07-06 06:59:17 -07:00
b44b1d868e Update psimd to psimd:072586a71b55b7f8c584153d223e95687148a900 (#40953) 2020-07-06 06:59:01 -07:00
9184c9832e Re-apply PyTorch pthreadpool changes (#40951)
* Re-apply PyTorch pthreadpool changes

Summary:
This re-applies D21232894 (b9d3869df3) and D22162524, plus updates jni_deps in a few places
to avoid breaking host JNI tests.

Test Plan: `buck test @//fbandroid/mode/server //fbandroid/instrumentation_tests/com/facebook/caffe2:host-test`

Reviewed By: xcheng16

Differential Revision: D22199952

fbshipit-source-id: df13eef39c01738637ae8cf7f581d6ccc88d37d5

* Enable XNNPACK ops on iOS and macOS.

Test Plan: buck run aibench:run_bench -- -b aibench/specifications/models/pytorch/pytext/pytext_mobile_inference.json --platform ios --framework pytorch --remote --devices D221 (9788a74da8)AP-12.0.1

Reviewed By: xta0

Differential Revision: D21886736

fbshipit-source-id: ac482619dc1b41a110a3c4c79cc0339e5555edeb

* Respect user set thread count. (#40707)

Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/40707

Test Plan: Imported from OSS

Differential Revision: D22318197

Pulled By: AshkanAliabadi

fbshipit-source-id: f11b7302a6e91d11d750df100d2a3d8d96b5d1db

* Fix and reenable threaded QNNPACK linear (#40587)

Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40587

Previously, this was causing divide-by-zero only in the multithreaded
empty-batch case, while calculating tiling parameters for the threads.
In my opinion, the bug here is using a value that is allowed to be zero
(batch size) for an argument that should not be zero (tile size), so I
fixed the bug by bailing out right before the call to
pthreadpool_compute_4d_tiled.

Test Plan: TestQuantizedOps.test_empty_batch

Differential Revision: D22264414

Pulled By: dreiss

fbshipit-source-id: 9446d5231ff65ef19003686f3989e62f04cf18c9

* Fix batch size zero for QNNPACK linear_dynamic (#40588)

Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40588

Two bugs were preventing this from working.  One was a divide by zero
when multithreading was enabled, fixed similarly to the fix for static
quantized linear in the previous commit.  The other was computation of
min and max to determine qparams.  FBGEMM uses [0,0] for [min,max] of
empty input, do the same.

Test Plan: Added a unit test.

Differential Revision: D22264415

Pulled By: dreiss

fbshipit-source-id: 6ca9cf48107dd998ef4834e5540279a8826bc754

Co-authored-by: David Reiss <dreiss@fb.com>
2020-07-06 06:58:25 -07:00
e89c4f0dec [quant] Fix fuse linear pass (#40549) (#40751)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40549

Currently we didn't check if %weight_t is produced by `aten::t`, this will fuse some `matmul`/`addmm` that is
not 2d to `aten::linear`, which is incorrect

Test Plan: Imported from OSS

Differential Revision: D22225921

fbshipit-source-id: 9723e82fdbac6d8e1a7ade22f3a9791321ab12b6
2020-07-02 10:23:22 -07:00
ea273c68f9 Inplace construct of TorchScript Module and inplace option for quantization (#40750)
* [WIP][JIT] Add ScriptModule._reconstruct (#39979)

Summary:
**Summary**
This commit adds an instance method `_reconstruct` that permits users
to reconstruct a `ScriptModule` from a given C++ `Module` instance.

**Testing**
This commit adds a unit test for `_reconstruct`.

**Fixes**
This pull request fixes https://github.com/pytorch/pytorch/issues/33912.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39979

Differential Revision: D22172323

Pulled By: SplitInfinity

fbshipit-source-id: 9aa6551c422a5a324b822a09cd8d7c660f99ca5c

* [quant][graphmode] Enable inplace option for top level API (#40414)

Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40414

after `_reconstruct` is supported in RecursiveScriptModule: https://github.com/pytorch/pytorch/pull/39979
we can support inplace option in quantization API

Test Plan: Imported from OSS

Differential Revision: D22178326

fbshipit-source-id: c78bc2bcf2c42b06280c12262bb31aebcadc6c32

Co-authored-by: Meghan Lele <meghanl@fb.com>
2020-07-02 10:22:45 -07:00
4dd37bfbf7 [jit] Remove unnecessary clone APIs for script::Module and RecursiveScriptModule (#40297) (#40748)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/40297

Test Plan: Imported from OSS

Differential Revision: D22191660

fbshipit-source-id: 4b338ca82caaca04784bffe01fdae3d180c192f4
2020-07-02 10:22:27 -07:00
2533b9da83 Fix complex printing for sci_mode=True (#40513) (#40919)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40513

This PR makes the following changes:
1. Complex Printing now uses print formatting for it's real and imaginary values and they are joined at the end.
2. Adding 1. naturally fixes the printing of complex tensors in sci_mode=True

```
>>> torch.tensor(float('inf')+float('inf')*1j)
tensor(nan+infj)
>>> torch.randn(2000, dtype=torch.cfloat)
tensor([ 0.3015-0.2502j, -1.1102+1.2218j, -0.6324+0.0640j,  ...,
        -1.0200-0.2302j,  0.6511-0.1889j, -0.1069+0.1702j])
>>> torch.tensor([1e-3, 3+4j, 1e-5j, 1e-2+3j, 5+1e-6j])
tensor([1.0000e-03+0.0000e+00j, 3.0000e+00+4.0000e+00j, 0.0000e+00+1.0000e-05j,
        1.0000e-02+3.0000e+00j, 5.0000e+00+1.0000e-06j])
>>> torch.randn(3, dtype=torch.cfloat)
tensor([ 1.0992-0.4459j,  1.1073+0.1202j, -0.2177-0.6342j])
>>> x = torch.tensor([1e2, 1e-2])
>>> torch.set_printoptions(sci_mode=False)
>>> x
tensor([  100.0000,     0.0100])
>>> x = torch.tensor([1e2, 1e-2j])
>>> x
tensor([100.+0.0000j,   0.+0.0100j])
```

Test Plan: Imported from OSS

Differential Revision: D22309294

Pulled By: anjali411

fbshipit-source-id: 20edf9e28063725aeff39f3a246a2d7f348ff1e8

Co-authored-by: anjali411 <chourdiaanjali123@gmail.com>
2020-07-02 09:45:35 -07:00
c5c8a85a82 If ninja is being used, force build_ext to run. (#40881)
As ninja has accurate dependency tracking, if there is nothing to do,
then we will very quickly noop.  But this is important for correctness:
if a change was made to a header that is not listed explicitly in
the distutils Extension, then distutils will come to the wrong
conclusion about whether or not recompilation is needed (but Ninja
will work it out.)

This caused https://github.com/pytorch/vision/issues/2367

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

ghstack-source-id: 6409595c8ac091f3863f305c123266b9d3a167ad
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40837
2020-07-02 08:05:25 -07:00
b4b8f5b9d4 Release GIL during DDP construction. (#40877)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40495

As part of debugging flaky ddp_under_dist_autograd tests, I realized
we were running into the following deadlock.

1) Rank 0 would go into DDP construction, hold GIL and wait for broadcast in
DDP construction.
2) Rank 3 is a little slower and performs an RRef fetch call before the DDP
construction.
3) The RRef fetch call is done on Rank 0 and tries to acquire GIL.
4) We now have a deadlock since Rank 0 is waiting for Rank 3 to enter the
collective and Rank 3 is waiting for Rank 0 to release GIL.
ghstack-source-id: 106534442

Test Plan:
1) Ran ddp_under_dist_autograd 500 times.
2) waitforbuildbot

Differential Revision: D22205180

fbshipit-source-id: 6afd55342e801b9edb9591ff25158a244a8ea66a

Co-authored-by: Pritam Damania <pritam.damania@fb.com>
2020-07-01 13:36:50 -07:00
41816dc97f [1.6] Fix dictConstruct ordering and enable dict mix (#40797)
A combination of https://github.com/pytorch/pytorch/pull/39601 and
https://github.com/pytorch/pytorch/pull/40424 both are approved and
merged in master
2020-07-01 09:30:16 -07:00
31d9776c04 [1.6] fix autograd doc subsubsection display issue (#40796)
Master branch PR: https://github.com/pytorch/pytorch/pull/40582
2020-07-01 09:28:25 -07:00
ddea6c552f Ports full dtype inference deprecation to 1.6 (#40799)
* ports full deprecation

* fixtures

* Fixes lint

* Trying to fix phantom lint issue

* nuclear lint option

* Paradoxical linter fix

Co-authored-by: Mike Ruberry <mruberry@devfair044.maas>
2020-07-01 09:27:27 -07:00
091537a764 [JIT][1.6] Shape analysis fixes. (#40716)
* [JIT] Update type of the unsqueeze's output in shape analysis.

* [JIT] Fix shape analysis for aten::masked_select.

The reference says that this op always returns a 1-D tensor, even if
the input and the mask are 0-D.
2020-07-01 08:41:05 -07:00
bf4d905ea1 Fix wrong MSVC version constraint for CUDA 9.2 (#40794) (#40849)
Summary:
Tested with https://github.com/pytorch/pytorch/pull/40782.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40794

Differential Revision: D22318045

Pulled By: malfet

fbshipit-source-id: a737ffd7cb8a6a9efb62b84378318f4c3800ad8f
2020-07-01 08:37:40 -07:00
415e499330 Fix zip serialization for file > 2GiB for Windows (#40852) 2020-07-01 08:36:40 -07:00
eaf7dad5d6 [1.6 cherrypick] Support Pathlike for zipfile serialization (#40793) 2020-06-30 10:38:00 -07:00
75a074abdc 1.6 Port: Dynamic Versioning (#40542)
Co-authored-by: Mike Ruberry <mruberry@devfair044.maas>
2020-06-30 10:18:18 -07:00
dede34eab7 [1.6 cherrypick] Doc fix for complex views
Cherry-pick of https://github.com/pytorch/pytorch/pull/40450

Test Plan: Imported from OSS
2020-06-30 09:37:02 -07:00
0c90b6da5c [1.6 cherrypick] Fix zip serialization for file > 2GiB (#40757)
* [1.6 cherrypick] Fix zip serialization for file > 2GiB

* Update test/test_serialization.py

Co-authored-by: Nikita Shulga <nikita.shulga@gmail.com>
2020-06-30 07:10:02 -07:00
4316199832 Add examples and tests for combining static/class method with async execution (#40619) (#40688)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/40619

Test Plan: Imported from OSS

Differential Revision: D22258407

Pulled By: mrshenli

fbshipit-source-id: 036d85a2affc4505efd2df197fc513dba010e359
2020-06-29 19:34:23 -07:00
f993e5ac88 [1.6] Update TensorPipe submodule (#40634)
Upstream PR: #40614

Summary:
This update pulls in a oneliner fix, which sets the TCP_NODELAY option on the TCP sockets of the UV transport. This leads to exceptional performance gains in terms of latency, with about a 25x improvement in one simple benchmark. This thus resolves a regression that TensorPipe had compared to the ProcessGroup agent and, in fact, ends up beating it by 2x.

The benchmark I ran is this, with the two endpoints pinned to different cores of the same machine:
```
torch.jit.script
def remote_fn(t: int):
    return t

torch.jit.script
def local_fn():
    for _ in range(1_000_000):
        fut = rpc.rpc_async("rhs", remote_fn, (42,))
        fut.wait()
```

And the average round-trip time (one iteration) is:
- TensorPipe with SHM: 97.2 us
- TensorPipe with UV _after the fix_: 205us
- Gloo: 440us
- TensorPipe with UV _before the fix_: 5ms

Test Plan: Ran PyTorch RPC test suite
2020-06-29 19:33:32 -07:00
c5bd737f0c [JIT] script if tracing fix (#40468) (#40572)
Summary:
Currently, torchvision annotates `batched_nms` with `torch.jit.script` so the function gets compiled when it is traced and ONNX will work. Unfortunately, this means we are eagerly compiling batched_nms, which fails if torchvision isn't built with `torchvision.ops.nms`. As a result, torchvision doesn't work on torch hub right now.

`_script_if_tracing` could solve our problem here, but right now it does not correctly interact with recursive compilation. This PR fixes that bug.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40468

Reviewed By: jamesr66a

Differential Revision: D22195771

Pulled By: eellison

fbshipit-source-id: 83022ca0bab6d389a48a478aec03052c9282d2b7

Co-authored-by: Elias Ellison <eellison@fb.com>
2020-06-29 19:30:41 -07:00
fe45c2c986 Allow slicing sequential container (#40538)
- fixes #38034
- works around missing slice functionality in Sequential
  by casting to tuple and slicing that instead
- supports iterating on the resulting slice but not call()
2020-06-29 19:29:19 -07:00
a9996bb482 Fixes caffe2 loading issues on Windows (#39513) (#40487)
Summary:
Addresses https://github.com/pytorch/pytorch/issues/27840#issuecomment-638715422.
Contains a bunch of fixes (https://github.com/pytorch/pytorch/pull/39376 + https://github.com/pytorch/pytorch/pull/39334 + https://github.com/pytorch/pytorch/pull/38302 + https://github.com/pytorch/pytorch/pull/35362)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39513

Differential Revision: D22190761

Pulled By: malfet

fbshipit-source-id: b2d52f6cb16c233d16071e9c0670dfff7da2710e
(cherry picked from commit e2201e2ed8ed7bf9c6226f8c484192949d94c248)
2020-06-29 19:17:34 -07:00
bdfcbfa18c [release/1.6] .jenkins: Install torch from test channel (#40706)
We're on a test branch so we should install from the test channel

Signed-off-by: Eli Uriegas <eliuriegas@fb.com>
2020-06-29 13:53:14 -07:00
ea1b0dba18 Remove constexpr for NVCC on Windows (#40676) 2020-06-29 13:48:50 -07:00
6d85b2c989 Pin XLA CI to use r1.6 release branch. (#40721) 2020-06-29 13:41:14 -07:00
44f79651a7 Tweak file_diff_from_base for release/1.6 branch (#40712) 2020-06-29 11:41:46 -07:00
8682ac147b Docs merge (#40569)
Co-authored-by: Elias Ellison <eellison@fb.com>
2020-06-26 12:24:08 -07:00
4cc605e80a (1.6) Update docs feature classifications (#40539)
Co-authored-by: Eli Uriegas <1700823+seemethere@users.noreply.github.com>
2020-06-26 12:23:02 -07:00
b0cce716f7 Add beta warning for quant docs (#40540)
Add a beta warning to match stable and master docs: https://github.com/pytorch/pytorch/blob/master/docs/source/quantization.rst
2020-06-26 12:20:06 -07:00
0dc93ac119 [v1.6.0 patch] Install method docstrings from PyRRef to RRef (#40620)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40461

It turned out `:inheried-members:` (see [doc](https://www.sphinx-doc.org/en/master/usage/extensions/autodoc.html#directive-autoclass)) is not really usable.

Because pybind11 generates a docstring that writes `self` as parent class, `rpc.PyRRef`, type.

As a workaround, I am pulling docstrings on parent-class, `PyRRef` class, into subclass, `RRef`. And do surgery on the docstring generated by pybind11.

{F241283111}

ghstack-source-id: 106472496

P134031188

Differential Revision: D7933834

fbshipit-source-id: c03a8a4c9d98888b64492a8caba1591595bfe247

Co-authored-by: Shihao Xu <shihaoxu@fb.com>
2020-06-26 12:15:28 -07:00
bb848df10b [1.6] Remove table of contents at the top of rpc.rst (#40482)
Master PR: https://github.com/pytorch/pytorch/pull/40205

Remove the table of contents created by the `.. contents:: :local: :depth: 2` since this page isn't one of the large documentation pages (https://github.com/pytorch/pytorch/issues/38010) and is simply a landing page for the Distributed RPC Framework.

Changes made in this original PR: f10fbcc820 (diff-250b9b23fd6f1a5c15aecdb72afb9d7d)
2020-06-26 08:37:49 -07:00
2dc0b84aca Skip test_mem_leak on Windows (#40498)
(cherry picked from commit 3fb6f038256a3a5ce43e857409ce4ffb807d93a5)
2020-06-25 16:45:48 -07:00
168cddf5f1 .circleci: Fix upload to backup directory
Signed-off-by: Eli Uriegas <eliuriegas@fb.com>
2020-06-23 20:57:42 -07:00
bc8760b3db .circleci: Fix pip installation of awscli
Signed-off-by: Eli Uriegas <eliuriegas@fb.com>
2020-06-23 19:05:48 -07:00
4269b9a8fc .circleci: Fix backup uploads
awscli was not loaded on conda builds and the backup upload did not work
since it was a recursive copy instead of just specifically copying what
we want.

Signed-off-by: Eli Uriegas <eliuriegas@fb.com>
2020-06-23 18:18:06 -07:00
6a421d50ab Enabling concat fast path for channels last inputs (#39448)
Summary:
Updates concat kernel for contiguous input to support channels_last contig tensors.

This was tried on squeezenet model on pixel-2 device. It improves model perf by about 25%.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39448

Test Plan: test_cat_in_channels_last

Differential Revision: D22160526

Pulled By: kimishpatel

fbshipit-source-id: 6eee6e74b8a5c66167828283d16a52022a16997f
2020-06-23 13:01:59 -07:00
27982d5711 fixes to layernorm emulation (#40422)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40422

fix the remaining differences to the emulation of fp16 layernorm

Test Plan: unit test of layernorm

Reviewed By: venkatacrc

Differential Revision: D22182849

fbshipit-source-id: 8a45c21418517d65d7a41663d5ad2110d6b4677a
2020-06-23 11:51:13 -07:00
b82bd654cc Increase shapes column length (#40440)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40440

Shapes sometimes need more than 35 symbols

(Note: this ignores all push blocking failures!)

Test Plan:
found during testing the recipe
https://github.com/pytorch/tutorials/pull/1019

Differential Revision: D22188679

Pulled By: ilia-cher

fbshipit-source-id: efcf5d10882af7d9225897ec87debcf4abdc523f
2020-06-23 10:49:01 -07:00
d8c384544e Destroy CUDA events after profiling (#39962)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39962

Adding a simple wrapper with ref count for cuda event and
destroying cuda event after the last copy is destroyed

Test Plan: CI cuda profiler tests

Differential Revision: D22027092

Pulled By: ilia-cher

fbshipit-source-id: e0810388aa60b2291eb010896e13af1fad92e472
2020-06-23 10:44:39 -07:00
a54bb4e907 Fix demangle 't' issue in profiler (#40416)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40416

Fix demangle 't' that produces 'unsigned short'

Test Plan:
>>> import torch
>>> from torch.autograd.profiler import profile
>>>
>>> t = torch.rand(4, 5)
>>> with profile() as prof:
...     t.t()
>>> print(prof.key_averages().table())

Differential Revision: D22179508

Pulled By: ilia-cher

fbshipit-source-id: b502af2f2547317c1a6447f2225d50b2376bfc76
2020-06-23 10:37:41 -07:00
3b040c478a Make custom_fwd a no-op when not executed under autocast (#36171)
Summary:
Currently, a custom autograd function written with
```
torch.cuda.amp.custom_fwd(cast_inputs=dtype)
def forward(ctx, *args):
    ...
```
casts incoming floating-point CUDA tensors to `dtype` unconditionally, regardless of whether the function executes in an autocast-enabled region.  I think I had the wrong idea there.  Autocast-disabled regions should give the user control of input types.  Also, `custom_fwd(cast_inputs=dtype)`-decorated functions' behavior should align with native fp32list/fp16list functions.  C++-side casting wrappers have no effect when autocast is disabled, and  `custom_fwd`'s casting should behave the same way.

The present PR changes `custom_fwd` so it only casts in autocast-enabled regions (also updates custom_fwd to ignore fp64 inputs, like the C++ wrappers).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36171

Differential Revision: D22179511

Pulled By: ngimel

fbshipit-source-id: 5a93d070179a43206066bce19da0a5a19ecaabbd
2020-06-23 10:23:02 -07:00
f652abc1dd [jit] Enable copy.deepcopy and copy.copy for RecursiveScriptModule (#32685)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32685

att

Test Plan:
.

Imported from OSS

Differential Revision: D21220755

fbshipit-source-id: 5c71e9bb9f43032cf60563a9e67579118a8d7e33
2020-06-23 09:21:12 -07:00
9bf255573f quant docs: add and clean up ELU (#40377)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40377

Cleans up the docstring for quantized ELU and adds it to the quantization docs.

Test Plan: * build on Mac OS and inspect

Differential Revision: D22162834

Pulled By: vkuzo

fbshipit-source-id: e548fd4dc8d67db27ed19cac4dbdf2a942586759
2020-06-23 09:02:43 -07:00
d71ec51c0e quant docs: add and clean up BatchNorm{n}d (#40346)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40346

Cleans up docstrings for quantized BatchNorm and adds to quantization docs

Test Plan: * build on Mac OS and inspect

Differential Revision: D22152633

Pulled By: vkuzo

fbshipit-source-id: e0bf02194158231e0205b5b2df7f6f1ffc3c4d65
2020-06-23 09:02:41 -07:00
5e683517a7 quant docs: add and clean up InstanceNorm{n}d (#40345)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40345

Fixes docstrings and adds to quantization docs for quantized InstanceNorm.

Test Plan: * build on Mac OS and inspect

Differential Revision: D22152637

Pulled By: vkuzo

fbshipit-source-id: 7a485311ead20796b7a0944827d1d04e14ec8dcd
2020-06-23 09:02:39 -07:00
6e3fdd77ca quant docs: add and clean up GroupNorm (#40343)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40343

Cleans up the quantized GroupNorm docstring and adds it to quantization docs.

Test Plan: * build on Mac OS and inspect

Differential Revision: D22152635

Pulled By: vkuzo

fbshipit-source-id: 5553b841c7a5d77f1467f0c40657db9e5d730a12
2020-06-23 09:02:36 -07:00
d15fcc7e49 quant docs: add and clean up LayerNorm (#40342)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40342

Cleans up the docstrings for quantized LayerNorm, and adds it to the docs.

Test Plan: * build on Mac OS and inspect

Differential Revision: D22152639

Pulled By: vkuzo

fbshipit-source-id: 38adf14b34675d1983ac4ed751938aa396e5400b
2020-06-23 09:02:34 -07:00
d27f8eaf92 quant docs: add and clean up hardtanh (#40341)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40341

Cleans up the hardtanh docstring and adds it to quantization docs.

Test Plan: * build and inspect on Mac OS

Differential Revision: D22152636

Pulled By: vkuzo

fbshipit-source-id: c98e635199c8be332aa6958664ff23faad834908
2020-06-23 09:02:32 -07:00
8e74fb6a0c quant docs: add and clean up hardsigmoid (#40340)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40340

Adds and simplifies quantization docs for hardsigmoid

Test Plan:
* build docs on Mac OS
* inspect

Differential Revision: D22152634

Pulled By: vkuzo

fbshipit-source-id: 18da273023fb00e5f0bc1e881b00536492c606d3
2020-06-23 09:02:29 -07:00
c4594a97ae quant docs: clean up hardswish (#40323)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40323

Cleans up the naming and the function param docs for quantized hardswish.
Remove redundant docstrings and link to floating point modules instead.

Test Plan:
* build the docs on Mac OS
* verify that every link works as expected

Differential Revision: D22152638

Pulled By: vkuzo

fbshipit-source-id: fef04874ae460b449c677424a6a1c6dd47054795
2020-06-23 08:59:34 -07:00
79736ff9c2 Simplify complex case for div_cpu (#39996)
Summary:
Simplify complex case for `div_cpu`

cc: zasdfgbnm
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39996

Differential Revision: D22169715

Pulled By: anjali411

fbshipit-source-id: e1822ee5c575cb8786b395c1bc7550890b38a60d
2020-06-23 08:48:10 -07:00
3e6fa778a5 Testcppextensionjit rebuild once (#40169)
Summary:
Previous:
    deco dont_wipe_extensions_build_folder control clean build path or not.
Now:
    If cpp files or args changed, rebuild extension. clean build path only before and after test suite.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40169

Differential Revision: D22161450

Pulled By: ezyang

fbshipit-source-id: 9167c8265e13922f68cd886be900f84ffc6afb84
2020-06-23 08:43:14 -07:00
54c05fa34e Add basic GPU support to distributed autograd. (#40312)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40312

As part of https://github.com/pytorch/pytorch/issues/40255, we
realized that GPU support for distributed autograd was broken as part of our
multithreaded autograd change.

To fix this in the short term for 1.6, this PR includes the following changes:

1) Long lived CPU thread in DistEngine to execute GPU->CPU continuations in the
autograd graph.
2) The long lived CPU thread has its own ready_queue and this queue is used for
all GraphTasks created by DistEngine.
3) In thread_main(), the CPU thread cannot exit once the GraphTask is done
processing because of the new CPU thread added in 1).
4) To resolve this, thread_main() now has a parameter `device_thread` instead
of `reentrant_thread`. When device_thread is True, we expect this to be a long
lived device thread that does not exit.
5) When device_thread is False, thread_main is expected to run a GraphTask and
return once done.
ghstack-source-id: 106391329

Test Plan: waitforbuildbot

Differential Revision: D22146183

fbshipit-source-id: dd146b7a95f55db75f6767889b7255e9d62d5825
2020-06-23 07:49:00 -07:00
e509c58a1c Set C++14 compatibility flag in torch_compile_options (#40399)
Summary:
Also mark warning modifiers as private options (i.e. libraries depending on `torch_cpu` do not have to be compiled with `-Wall`)
Closes https://github.com/pytorch/pytorch/issues/31283
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40399

Differential Revision: D22186206

Pulled By: malfet

fbshipit-source-id: 1ad4277b5acc5c39849a3e4efe4b93a189d26e59
2020-06-23 07:10:22 -07:00
2acee6dc93 Revert D22124313: Use Int8QuantParamsBlob to pass the scale and zeropoint params
Test Plan: revert-hammer

Differential Revision:
D22124313

Original commit changeset: 6b5c1974c0fc

fbshipit-source-id: 87a9a64c323be40db5d7d584029efa10c779dfa1
2020-06-23 05:54:44 -07:00
08ae7d3a71 [Codemod][FBSourceGoogleJavaFormatLinter] Daily arc lint --take GOOGLEJAVAFORMAT
Reviewed By: zertosh

Differential Revision: D22183348

fbshipit-source-id: afd4f7e8c18587c6ce1e1d6e76c8eeb9c558de15
2020-06-23 05:26:55 -07:00
1ec4337b7d Use Int8QuantParamsBlob to pass the scale and zeropoint params (#40390)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40390

Change the Int8FC/Int8Quantize op interface to use Int8QuantParamsBlob as the qparam input blob format when needed.

Test Plan:
```
 buck test caffe2/caffe2/quantization/server:
```

Reviewed By: hx89

Differential Revision: D22124313

fbshipit-source-id: 6b5c1974c0fc5928f72773495f0da8d0eb9b98c9
2020-06-23 00:45:21 -07:00
78b3d5f878 [TensorPipe] Register multiplexing channel over UV (#40389)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40389

The `mpt_uv` channel MultiPlexes over a Transport, namely the UV one. What this means is that it takes a tensor, chunks it into equal parts and sends each of them on a separate UV connection, each running in a separate UV loop. Thus they each have their own socket and thread. This allows them to reach bandwidths that go beyond what a simple single-threaded approach can do, which is necessary to reach the high bandwidths of some modern NICs.
ghstack-source-id: 106375511

Test Plan: Ran a few manual tests myself, for the rest relied on the PyTorch RPC tests.

Differential Revision: D22144380

fbshipit-source-id: ef555fa04c6f13a4acf3bd5f7b03d04d02460d38
2020-06-23 00:24:17 -07:00
e9efad6878 [ROCM][CI] Skip fp16 bench and 2-GPU runs (#40243)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40243

rocm bench has a large backlog right now. Let's skip some tests.

Test Plan: CI

Reviewed By: ezyang

Differential Revision: D22125197

fbshipit-source-id: 330b52ce7f97af4e45c58f25bc7d57351d7c4efb
2020-06-22 21:53:55 -07:00
ba89a89376 [quant][graphmode][refactor] InsertQuantDeQuantHelper (#40384)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/40384

Test Plan: Imported from OSS

Differential Revision: D22164072

fbshipit-source-id: 0ca86265cfef1afa99dd860a452f3dd76e31792a
2020-06-22 21:30:17 -07:00
6c40ec55df Revert D22165477: [pytorch][PR] [JIT] Fork/Join inline docs
Test Plan: revert-hammer

Differential Revision:
D22165477

Original commit changeset: 93132cd6987f

fbshipit-source-id: f3d5d35b6640d786ec3bada1396b5d7ad645c26d
2020-06-22 20:51:56 -07:00
7bf1dd582a Fix Cuda IPC deadlock (#40347)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40347

Fixes: #39541
Fixes: #25301

Differential Revision: D22152662

Test Plan: Imported from OSS

Pulled By: VitalyFedyunin

fbshipit-source-id: 82548aa4c937e0260932244e78cb132bcb3209b3
2020-06-22 20:50:25 -07:00
18122facb9 [quant][graphmode] Add warning for debug option for add_scalar/mul_scalar (#40383)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40383

debug option is not supported for these cases, so we print a warning if it occurs

Test Plan: Imported from OSS

Differential Revision: D22164071

fbshipit-source-id: 90459530f4efdd6d255df4f015606cb0e9070cd3
2020-06-22 20:29:44 -07:00
5766da503b Device name should be a string, not bytes (#40322)
Summary:
I.e. do not accept `bytes` as possible type of `device` argument in
`torch.cuda._get_device_index`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40322

Differential Revision: D22176885

Pulled By: malfet

fbshipit-source-id: 2f3a46174161f1cdcf6a6ad94a31e54b18ad6186
2020-06-22 19:27:25 -07:00
0d24ed0c81 Add note to torch.save (#40394)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/40394

Test Plan: Imported from OSS

Reviewed By: ailzhang

Differential Revision: D22168181

Pulled By: jamesr66a

fbshipit-source-id: 634104a1c18faf3b6cb0e0f49d3980d671a141f4
2020-06-22 18:41:58 -07:00
64f925eb0c [quant][graphmode] Add support for functional linear (#40331)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/40331

Test Plan: Imported from OSS

Differential Revision: D22162905

fbshipit-source-id: 3e0320d5f5c267c778af8e2fe4224f8383aab2c8
2020-06-22 18:05:06 -07:00
b02c932fb6 qat eager: remove unneeded modules (#40396)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40396

Removes activation and normalization modules from eager mode QAT.
These were incorrectly added, but we don't actually need them.

Test Plan:
```
python test/test_quantization.py TestQuantizationAwareTraining
```

Imported from OSS

Differential Revision: D22169768

fbshipit-source-id: b5bd753dafe92e90e226fb773eb18c6aae179703
2020-06-22 17:45:51 -07:00
d7d75e37bb Add state dict for LSTM and RNNCell and helper functions for accessing weights and bias (#40333)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40333

Add state_dict support for dynamic quantized LSTM/GRU/RNNCell.

Add helper functions get_weight and get_bias for LSTM and RNNCells
ghstack-source-id: 106364749

(Note: this ignores all push blocking failures!)

Test Plan:
buck test caffe2/test:quantization -- 'test_lstm_api \(quantization\.test_quantized_module\.TestDynamicQuantizedModule\)' --print-passing-details

buck test caffe2/test:quantization -- 'test_cell_api \(quantization\.test_quantized_module\.TestDynamicQuantizedModule\)' --print-passing-details

Differential Revision: D22151020

fbshipit-source-id: 2eb54062f6c6a35ffe4dbe8e8cfbf7ede0e92ba1
2020-06-22 17:41:07 -07:00
8066fba226 [RELAND2] Change AccumulateGrad to yield .grads that match weights' memory layout (#40358)
Summary:
https://github.com/pytorch/pytorch/pull/40129 fixed the error responsible for the first revert, but exposed another error in the same test.

This PR is intended as the "master copy" for merge, and it runs on full CI.
Two other PRs (restricted to run on a small subset of CI) supporting debugging DDP failures/hangs with multiple devices per process (`test_c10d.py:DistributedDataParallelTest.test_grad_layout_1devicemodule_2replicaperprocess`).
- https://github.com/pytorch/pytorch/pull/40290 tries the test with purely rowmajor contiguous params on an untouched master.  In other words https://github.com/pytorch/pytorch/pull/40290 contains none of this PR's diffs aside from the test itself.
- https://github.com/pytorch/pytorch/pull/40178, for comparison, tries the test with this PR's diffs.

Both fail the same way, indicating failure is unrelated to this PR's other diffs.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40358

Differential Revision: D22165785

Pulled By: albanD

fbshipit-source-id: ac7cdd79af5c080ab74341671392dca8e717554e
2020-06-22 17:13:21 -07:00
9e5d62582c [android][gradle] packaging headers in aars for publishing (#40392)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/40392

Test Plan: Imported from OSS

Differential Revision: D22167757

Pulled By: IvanKobzarev

fbshipit-source-id: 363319c64933382c0b0ddce65624fe5a4602da26
2020-06-22 16:56:39 -07:00
ae2f1f0372 [DDP Note] Remove refs to RoundRobin PG until we officially support it (#40380)
Summary:
Removes line mentioning `ProcessGroupRoundRobin` since we don't intend it to be used as a public API just yet. We can add this back when we officially support the API
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40380

Differential Revision: D22165556

Pulled By: rohan-varma

fbshipit-source-id: 24d0477d881dc74f2ff579de61dfd1ced2b09e75
2020-06-22 16:19:29 -07:00
016cf7d66e Revert D22102408: DNNL: enable conv3d
Test Plan: revert-hammer

Differential Revision:
D22102408

Original commit changeset: 1e95cede429f

fbshipit-source-id: a20b725164177e8571320007548a58cc4779d669
2020-06-22 15:41:51 -07:00
17fe0e2b8a Revert D22102407: DNNL: enable batchnorm3d
Test Plan: revert-hammer

Differential Revision:
D22102407

Original commit changeset: c9dbb61d0538

fbshipit-source-id: d40976aa8120d2d0839624bf02c082d7d1eb610d
2020-06-22 15:39:37 -07:00
0d0608532c [JIT] Fork/Join inline docs (#39952)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/39952

Differential Revision: D22165477

Pulled By: eellison

fbshipit-source-id: 93132cd6987fdd2484112a57ef17912b8fcc5fab
2020-06-22 15:34:05 -07:00
13a8ec3cc5 Revert D22102406: DNNL: enable max_pool3d and avg_pool3d
Test Plan: revert-hammer

Differential Revision:
D22102406

Original commit changeset: 296a87188b79

fbshipit-source-id: ff023be5e8dd4bfcd68770cab305da6ba2e03893
2020-06-22 15:23:01 -07:00
9498e24ca8 Revert D22138737: DNNL: enable dilation conv
Test Plan: revert-hammer

Differential Revision:
D22138737

Original commit changeset: 4225bc7d2624

fbshipit-source-id: 7bbafbe9f412a8f9167e3ae4425dbc933ec67c6b
2020-06-22 15:20:55 -07:00
8ec2ae9a9f Add view_as_real, view_as_complex for complex tensors (#39099)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/39099

Test Plan: Imported from OSS

Differential Revision: D22057886

Pulled By: anjali411

fbshipit-source-id: bad5ba7097ba0dd13f2c549b2463094dee9afa14
2020-06-22 15:15:27 -07:00
7a3c223bbb Migrate var & std to ATen (#39967)
Summary:
Not sure why there are so many issues for std & var, but this PR should close them all:
std: Fix https://github.com/pytorch/pytorch/issues/24771, Fix https://github.com/pytorch/pytorch/issues/24676, Fix https://github.com/pytorch/pytorch/issues/24639, Fix https://github.com/pytorch/pytorch/issues/24529
var: Fix https://github.com/pytorch/pytorch/issues/24782, Fix https://github.com/pytorch/pytorch/issues/24677, Fix https://github.com/pytorch/pytorch/issues/24652, Fix https://github.com/pytorch/pytorch/issues/24530

```py
import time
import torch

def _time():
    if torch.cuda.is_available():
        torch.cuda.synchronize()
    return time.time()

for device in (torch.device("cpu"), torch.device("cuda")):
    for size in (
        [100000000],
        [10000, 10000],
        [1000, 1000, 100],
        [100, 100, 100, 100],
    ):
        t = torch.randn(*size, device=device)
        total_time = 0
        for i in range(10):
            t1 = _time()
            t.std()
            t2 = _time()
            total_time += t2 - t1
        print(f"Tensor of size {size} on {device}: {total_time / 10}")
```

Before:
```
Tensor of size [100000000] on cpu: 0.36041643619537356
Tensor of size [10000, 10000] on cpu: 0.37235140800476074
Tensor of size [1000, 1000, 100] on cpu: 0.386572527885437
Tensor of size [100, 100, 100, 100] on cpu: 0.37404844760894773
Tensor of size [100000000] on cuda: 0.0021645784378051757
Tensor of size [10000, 10000] on cuda: 0.002090191841125488
Tensor of size [1000, 1000, 100] on cuda: 0.00208127498626709
Tensor of size [100, 100, 100, 100] on cuda: 0.0020844221115112306
```

After:
```
Tensor of size [100000000] on cpu: 0.1339871883392334
Tensor of size [10000, 10000] on cpu: 0.1343991994857788
Tensor of size [1000, 1000, 100] on cpu: 0.1346735954284668
Tensor of size [100, 100, 100, 100] on cpu: 0.11906447410583496
Tensor of size [100000000] on cuda: 0.0013531208038330077
Tensor of size [10000, 10000] on cuda: 0.0012922048568725585
Tensor of size [1000, 1000, 100] on cuda: 0.001285886764526367
Tensor of size [100, 100, 100, 100] on cuda: 0.0012899160385131836
```

cc: VitalyFedyunin
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39967

Differential Revision: D22162469

Pulled By: VitalyFedyunin

fbshipit-source-id: 8d901c779767b00f81cd6231bc665e04f297b4c3
2020-06-22 14:25:18 -07:00
c4fc278fa8 Build docker for CUDA11 (#40231)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/40231

Differential Revision: D22168068

Pulled By: malfet

fbshipit-source-id: 4706e7a113c2006acbb1a63ff8a657975aa5369b
2020-06-22 13:55:15 -07:00
02ae9a1583 add TypeError to c10 and fix segfault in error checking in Tensor constructor (#40106)
Summary:
As per title.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40106

Differential Revision: D22137193

Pulled By: albanD

fbshipit-source-id: 11d059263c00a834211f016bd9a9e18fdc0437ef
2020-06-22 13:42:44 -07:00
a8ab78c815 Added a link to Contribution guide in Readme (#40353)
Summary:
Added a link to `CONTRIBUTION.md` in `README.md` for easy reference.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40353

Differential Revision: D22167138

Pulled By: ezyang

fbshipit-source-id: fe7b7f190c8135fdd2e71696c1cf8d84bcd40fc6
2020-06-22 13:20:06 -07:00
dbcc5b7533 DNNL: enable dilation conv (#40220)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/40220

Test Plan: Imported from OSS

Differential Revision: D22138737

Pulled By: VitalyFedyunin

fbshipit-source-id: 4225bc7d26241b443d18ef9d56326e5a9e6bbeda
2020-06-22 13:14:09 -07:00
43331609a4 Port addmm, addbmm, addr to ATen (CUDA) (#38421)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/24536, fixes https://github.com/pytorch/pytorch/issues/24534 and fixes https://github.com/pytorch/pytorch/issues/24533
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38421

Differential Revision: D22138333

Pulled By: VitalyFedyunin

fbshipit-source-id: f4411d0df0a001bbb95089eb55fdcac3aba86700
2020-06-22 13:02:33 -07:00
c873895722 DNNL: enable max_pool3d and avg_pool3d (#35664)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/35664

Test Plan: Imported from OSS

Differential Revision: D22102406

Pulled By: VitalyFedyunin

fbshipit-source-id: 296a87188b79545741f6b7e136a58e4380564f25
2020-06-22 11:57:12 -07:00
8df35fd755 DNNL: enable batchnorm3d (#35663)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/35663

Test Plan: Imported from OSS

Differential Revision: D22102407

Pulled By: VitalyFedyunin

fbshipit-source-id: c9dbb61d0538ab9e1e76b2815564030b5f89d33e
2020-06-22 11:57:09 -07:00
6ba807cb43 DNNL: enable conv3d (#35662)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/35662

Test Plan: Imported from OSS

Differential Revision: D22102408

Pulled By: VitalyFedyunin

fbshipit-source-id: 1e95cede429f1a950f26bc7052ab33d198857df3
2020-06-22 11:55:04 -07:00
03af4dcbbf Utilise the vector version for sinh and cosh (UnaryOpsKernel) (#36396)
Summary:
Utilise the existing methods of `Vec256` class.

Not sure if there should be tests and if yes where.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36396

Differential Revision: D22155803

Pulled By: VitalyFedyunin

fbshipit-source-id: 500dcb5c79650bc5daa0c9683d65eeab6f9dd1d3
2020-06-22 11:38:37 -07:00
87c5f02f3d jit: Conv3d + BatchNorm3d fusion (#40082)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/40082

Differential Revision: D22120340

Pulled By: jerryzh168

fbshipit-source-id: fce6c5f03fe7ab6c60620cbdf547d5a466a470e3
2020-06-22 11:15:52 -07:00
14f7e95c1a Add prefix of remote events for RPC profiling (#40066)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40066

Builds on top of the previous PR to ensure that all remotely profiled events are prefixed with the key for the RPC that generated them.

The key is generated by the result of `_build_rpc_profiling_key` in `rpc/internal.py` and prefixed onto the event name. In order to do this, we set the current-key when creating the RPC in Python, retrieve the currently-set key in C++ and save a GloballyUniqueId -> key mapping to an in-memory map. When we receive an RPC with profiling information, we expect to receive this ID back, and look up the corresponding profiling key in the map.

The key is then added to all the remote events.

Tested by adding tests to ensure the key is added to all the remote events. Also added a UT which tests in under the multi-threading scenario, to ensure that the mapping's correctness is maintained when several RPCs are in the process of being created at once.
ghstack-source-id: 106316106

Test Plan: Unit test

Differential Revision: D22040035

fbshipit-source-id: 9215feb06084b294edbfa6e03385e13c1d730c43
2020-06-22 11:01:07 -07:00
17d3f74ea3 Relax cudnn conditions for channels-last convolutions (#38904)
Summary:
Follow up of https://github.com/pytorch/pytorch/issues/38044. Thanks ptrblck, mcarilli for the help on discussing the changes!

Could fix https://github.com/pytorch/pytorch/issues/37725 by skipping the depthwise-workload check introduced in https://github.com/pytorch/pytorch/issues/22302. This PR also relaxed dilated convolution for channels-last.

The testing script is https://gist.github.com/xwang233/82a707f69bb710cb612349280a2c5f41. About 387k conv arguments were tested and no cudnn exception was thrown.

cc ngimel VitalyFedyunin ptrblck mcarilli
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38904

Differential Revision: D22155797

Pulled By: VitalyFedyunin

fbshipit-source-id: 81b5736cec67ea263029121521c6acafd9dddba6
2020-06-22 10:59:37 -07:00
c72ab19458 Add addmv for complex dtypes (#40238)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/40238

Differential Revision: D22160528

Pulled By: anjali411

fbshipit-source-id: 04093e5929318a7acc9c9b502c76d0a8bf15d5e1
2020-06-22 10:54:35 -07:00
3894de569e Reenable memory format test for some unary functions (#39102)
Summary:
Many of them have already been migrated to ATen
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39102

Differential Revision: D22162193

Pulled By: VitalyFedyunin

fbshipit-source-id: 80db9914fbd792cd610c4e8ab643ab97845fac9f
2020-06-22 10:46:28 -07:00
9f9e7c1d71 [quant][refactor] Tests for torch.jit.quantized (#40330)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/40330

Test Plan: Imported from OSS

Differential Revision: D22149707

fbshipit-source-id: 44e7545bf9277d9245b5e9c2d9461f664fff0426
2020-06-22 10:41:31 -07:00
eaa91071ca [ONNX] Support large attribute and subgraph for large model (#38793)
Summary:
Previously large tensor data in attributes and subgraphs are not stored externally. ONNX won't be able to serialize the model for cases where the total size sums up to >= 2GB. This PR enables that.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38793

Reviewed By: hl475

Differential Revision: D22111092

Pulled By: houseroad

fbshipit-source-id: 355234e50825d576754de33c86a9690161caaeaf
2020-06-22 10:34:37 -07:00
49887d1fc0 reference Swish implementation (#40150)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40150

added a skeleton for a Swish implementation using fakelowp
this implementation is as precise as it gets since it uses computation in fp32 as a reference

-simplified the test since this is a linear sweep, no need to randomize it
-modified the domain to ensure that 0 is always covered

Test Plan: ran this test against the lowered swish implementation and found that the interpolation domain should be [-21,12] to cover even the smallest value in the Y domain

Reviewed By: venkatacrc

Differential Revision: D22025105

fbshipit-source-id: dd8561243182c359003b4370ce2312f607d964c9
2020-06-22 10:28:00 -07:00
3fa0b1e325 ONNX: fix bug in export of cumsum operator (#40044)
Summary:
The "cast" operator is currently added after the cumsum operator, but it should be added before, since torch.cumsum supports more types than ONNX (specifically, bool).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40044

Reviewed By: hl475

Differential Revision: D22158013

Pulled By: houseroad

fbshipit-source-id: e6c706572b9b8de880d4d71eaa132744ef01ad4d
2020-06-22 10:11:35 -07:00
c04d39aaf2 [quant][bug] Histogram observer bug fix with min == max (#40310)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/40310

Test Plan:
python test/test_quantization.py test_histogram_observer_same_inputs

Imported from OSS

Differential Revision: D22145908

fbshipit-source-id: c1646d9ae6738755981fe3d09c8a8e25fcb994d4
2020-06-22 10:05:10 -07:00
766889b6bf ONNX: fix bug in export of ops involving torch.bool type (#40006)
Summary:
When an op involves creating a tensor of a certain type (such as torch.ones(...)), the tracer creates a `prim::Constant` node with an integer value representing the type. The mapping from the torch type to integers maps:
```
torch.complex32 -> 8
torch.complex64 -> 9
torch.complex128 -> 10
torch.bool -> 11
```
However, when the ONNX exporter maps back the integer to torch type, 10 is mapped to bool, 9 is mapped to complex128 and 8 is mapped to complex64.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40006

Reviewed By: hl475

Differential Revision: D22158019

Pulled By: houseroad

fbshipit-source-id: 42fbd6b56566017ff03382c4faf10d30ffde3802
2020-06-22 09:57:25 -07:00
0e146d2df4 Update TensorPipe submodule (#40374)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40374

To pick up two fixes to MPT:
4b1b855f21
462200aad3

MPT isn't yet used by PyTorch so this should have no effect

Test Plan: Export to CircleCI and test

Reviewed By: patricklabatut

Differential Revision: D22160029

fbshipit-source-id: 202ea7487fcde015e5856f71ad6aebdfa6564ee1
2020-06-22 09:40:17 -07:00
e4766fb4d9 Meta tensors, but without code deduplication (#38490)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38490

A meta tensor is a tensor that is a lot like a normal tensor,
except it doesn't actually have any data associated with it.
You can use them to carry out shape/dtype computations without
actually having to run the actual code; for example, this could
be used to do shape inference in a JIT analysis pass.
Check out the description in DispatchKey.h for more information.

Meta tensors are part of a larger project to rationalize how we
write kernels so that we don't have to duplicate shape logic
in CPU kernel, CUDA kernel and meta kernel (this PR makes the
duplication problem worse!)  However, that infrastructure can
be built on top of this proof of concept, which just shows how
you can start writing meta kernels today even without this
infrastructure.

There are a lot of things that don't work:
- I special cased printing for dense tensors only; if you try to
  allocate a meta sparse / quantized tensor things aren't going
  to work.
- The printing formula implies that torch.tensor() can take an
  ellipsis, but I didn't add this.
- I wrote an example formula for binary operators, but it isn't
  even right!  (It doesn't do type promotion of memory layout
  correctly).  The most future proof way to do it right is to
  factor out the relevant computation out of TensorIterator,
  as it is quite involved.
- Nothing besides torch.add works right now
- Meta functions are ALWAYS included in mobile builds (selective
  build doesn't work on them).  This isn't a big deal for now
  but will become more pressing as more meta functions are added.

One reason I'm putting up this PR now is to check with Yinghai Lu
if we can unblock shape inference for accelerators, while we are
still working on a long term plan for how to unify all shape
computation across our kernels.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Differential Revision: D21935609

Pulled By: ezyang

fbshipit-source-id: f7d8636eeb8516b6bc296db99a16e56029972eee
2020-06-22 09:18:33 -07:00
52f3a09663 ROCm: Use correct device type when exporting tensors to DLPack (#40124)
Summary:
Before this PR, DLPack export was tricked by the CUDA masquerading of the HIP backend into thinking that it was exporting a CUDA tensor. We change that to use the ROCM device type instead.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40124

Differential Revision: D22145215

Pulled By: ezyang

fbshipit-source-id: 276f709861c55f499ae753d0bba48ddcc8b85926
2020-06-22 08:59:43 -07:00
db5b273961 Rename dont_resize_outputs() to resize_outputs(false) (TensorIterator… (#40148)
Summary:
…) https://github.com/pytorch/pytorch/issues/40119
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40148

Differential Revision: D22144022

Pulled By: ezyang

fbshipit-source-id: abd697ffd88b927877875e4c431ee39bd21eba24
2020-06-22 08:55:02 -07:00
396087bfd8 [ROCm] Enable BFloat16 for pow, exp, erf ops on ROCm (#40236)
Summary:
Enable ops used in BERT which were missed in one of my earlier PRs.
ezyang jeffdaily
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40236

Differential Revision: D22143965

Pulled By: ezyang

fbshipit-source-id: 5464ed021687fec1485e1c061e5a7aba71687fc4
2020-06-22 08:22:17 -07:00
881c1adfcd Fixed buffer update in BatchNorm when track_running_stats is set to False (#38084)
Summary:
This PR aims at tackling https://github.com/pytorch/pytorch/issues/37823 by:
- ensuring that buffers will be used for normalization computation but won't be updated, when buffers are not None, and `track_running_stats=False`
- adding a corresponding unittest to ensure expected behaviour

Any feedback is welcome!

_Note: we might want to update the docstrings of  `BatchNorm*d`, feel free to share any suggestion!_
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38084

Differential Revision: D22047871

Pulled By: ezyang

fbshipit-source-id: 5acbcad9773e7901f26d625db71d43d7dc236d3e
2020-06-22 08:17:31 -07:00
eb92ed6239 Append forward slashes to PIP_UPLOAD_FOLDER (#40352)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/40352

Differential Revision: D22160395

Pulled By: ezyang

fbshipit-source-id: bb803c8a7cf7f8fd7682095b8b1917ec22a15495
2020-06-22 07:08:33 -07:00
37c88a4731 Pin the version of scipy for Windows test jobs (#40369)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/40366.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40369

Differential Revision: D22160114

Pulled By: malfet

fbshipit-source-id: ea4c1349fc83787853d4925e7d0a2a63aecf0d77
2020-06-22 06:34:15 -07:00
ab8a99bd36 graph mode: add hardswish inplace handling (#40284)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40284

Adds graph mode handling for inplace hardswish, and test coverage for functional hardswish.

Test Plan:
```
python test/test_quantization.py TestQuantizeScriptPTSQOps.test_hardswish
```

Imported from OSS

Differential Revision: D22140628

fbshipit-source-id: 55a514f7dc1130d510f69ee4e611d7cb5e08d02e
2020-06-21 09:40:50 -07:00
c6dbfcaf9e quantized elu: graph mode handling (#40111)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40111

Adds graph mode handling for quantized elu.

Test Plan:
```
python test/test_quantization.py TestQuantizeScriptPTSQOps.test_elu
```

Imported from OSS

Differential Revision: D22075080

fbshipit-source-id: 37fb1b9e390f2a33d47cbd025157532379b6aa64
2020-06-21 09:40:48 -07:00
cd0afe2b8e quantized elu: eager mode QAT handling (#40104)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40104

Adds eager mode QAT handling for quantized ELU.

Test Plan:
```
python test/test_quantization.py TestQuantizationAwareTraining.test_activations
```

Imported from OSS

Differential Revision: D22075082

fbshipit-source-id: 90eb06e4c52ec542fda97d7ee108a38465d3e845
2020-06-21 09:40:46 -07:00
03ed802a90 quantized elu: eager mode static handling (#40103)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40103

Add eager mode static quantization handling for quantized ELU.

Test Plan:
```
python test/test_quantization.py TestStaticQuantizedModule.test_elu
python test/test_quantization.py TestPostTrainingStatic.test_activations
```

Imported from OSS

Differential Revision: D22075081

fbshipit-source-id: 8a3df428be135a0565472ebd0f55fa801689bcc5
2020-06-21 09:40:44 -07:00
13d54c6471 quantized elu: require observation (#40100)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40100

ELU has a range of [-1, inf]. In the original PR which added
the quantized operator we decided to pass the quantization params
from the input.  However, it makes more sense to require observation
for this op.

This PR changes the API to require observation. Next PRs in this stack
will add the eager and graph mode handling.

Test Plan:
```
python test/test_quantization.py TestQuantizedOps.test_qelu
```

Imported from OSS

Differential Revision: D22075083

fbshipit-source-id: 0ea0fd05a00cc7a5f122a2b1de09144bbd586f32
2020-06-21 09:38:28 -07:00
3bbedb34b9 restore generic IndexToScatterGatherOffset specialization (#40349)
Summary:
https://github.com/pytorch/pytorch/issues/39963 erroneously removed template specialization to compute offsets, causing cases relying on this specialization (topk for 4d+ tensors with topk dimension >= 1024/2048 depending on the type) to produce bogus results.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40349

Differential Revision: D22153756

Pulled By: ngimel

fbshipit-source-id: cac04969acb6d7733a7da2c1784df7d30fda1606
2020-06-20 23:14:13 -07:00
e632bf8d57 Add thrift and tensorpipe backend tests for test_ddp_under_dist_autograd. (#40210)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40210

ghstack-source-id: 106300839

Test Plan: waitforbuildbot

Differential Revision: D22110065

fbshipit-source-id: d9ebd009b8d451c75708eadc7eb3f2b788e875aa
2020-06-20 22:59:59 -07:00
ac8c3c0ad1 Fix update_s3_html for nightly jobs (#40338)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/40337.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40338

Differential Revision: D22152913

Pulled By: seemethere

fbshipit-source-id: bb76c726820efc7d0127201c4bd072bba95783c5
2020-06-20 15:08:06 -07:00
3852215170 [vulkan] jit passes for vulkan conv2 prepack and fuse with clamp (#39282)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/39282

Test Plan: Imported from OSS

Differential Revision: D21962424

Pulled By: IvanKobzarev

fbshipit-source-id: 2d20e827d2c3836b7e6b443293377c68dc1ffa5a
2020-06-20 14:12:21 -07:00
f69460d0cb Add unit test to verify DDP + RPC correctness. (#40139)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40139

This unit test runs the same set of operations locally and then with
DDP + RPC to verify correctness.
ghstack-source-id: 106287490

Test Plan:
buck test mode/dev-nosan caffe2/test/distributed/:ddp_under_dist_autograd

I ran these to make sure I am workin on a clean git repo.

git submodule update --init --recursive

to get latest tensor pipe code, otherwise build will have error.

to record installed binaries and torch package wheels to system paths

with-proxy env BUILD_CAFFE2_OPS=0 USE_CUDA=0 USE_MKLDNN=0 USE_DISTRIBUTED=1 python setup.py install --record files.txt

remove binaries and torch package wheels from system paths.

xargs rm -rf < files.txt

build in develop mode

with-proxy env BUILD_CAFFE2_OPS=0 USE_CUDA=0 USE_MKLDNN=0 USE_DISTRIBUTED=1 python setup.py develop

pytest test/distributed/test_ddp_under_dist_autograd.py::TestDdpUnderDistAutograd -v

Differential Revision: D22084385

fbshipit-source-id: e1f57e86ceddd4c96920ed904898e1763b47e8f2
2020-06-20 13:13:32 -07:00
a47fb57957 Change memory format promotion rules of point wise operators. (#37968)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37968

Modify memory format promotion rules to avoid promoting when one of the input is ambiguous. New rules are:
 Ambiguous + Contiguous = Contiguous
 Ambiguous + Channels Last = Channels Last
 Contiguous + Ambiguous ( NC11 ) = Contiguous
 Contiguous + Channels Last = Contiguous ( + Warning )  Before this PR: Channels Last
 Channels Last + Contiguous = Channels Last ( + Warning )
 Channels Last + Ambiguous = Channels Last
 Bias + Channels Last = Channels Last
 Channels Last + Bias = Channels Last

Test Plan: Imported from OSS

Differential Revision: D21819573

Pulled By: VitalyFedyunin

fbshipit-source-id: 7381aad11720b2419fb37a6da6ff4f54009c6532
2020-06-20 10:33:32 -07:00
c1dfc05cc9 [android][test_app][reland] test_app example linking to pytorch_android aar content (#40313)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/40313

Test Plan: Imported from OSS

Differential Revision: D22147079

Pulled By: IvanKobzarev

fbshipit-source-id: c70a0a9dda8834376ed304a461318d4c6ef84582
2020-06-20 07:34:42 -07:00
4cbf87dc92 [PyTorch Numeric Suite] Add support for dynamic LSTM (#40065)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40065

Add support for dynamic LSTM of all three Numeric Suite APIs: compare_weights(), compare_model_stub() and compare_model_outputs().
ghstack-source-id: 106291782

Test Plan:
buck test mode/dev caffe2/test:quantization -- 'test_compare_weights_lstm_dynamic'

buck test mode/dev caffe2/test:quantization -- 'test_compare_model_stub_lstm_dynamic'

buck test mode/dev caffe2/test:quantization -- 'test_compare_model_outputs_lstm_dynamic'

buck test mode/dev caffe2/test:quantization -- 'test_compare_weights_conv_static'

buck test mode/dev caffe2/test:quantization -- 'test_compare_weights_linear_static'

buck test mode/dev caffe2/test:quantization -- 'test_compare_weights_linear_dynamic'

buck test mode/dev caffe2/test:quantization -- 'test_compare_model_stub_conv_static'

buck test mode/dev caffe2/test:quantization -- 'test_compare_model_stub_linear_static'

buck test mode/dev caffe2/test:quantization -- 'test_compare_model_stub_submodule_static'

buck test mode/dev caffe2/test:quantization -- 'test_compare_model_stub_functional_static'

buck test mode/dev caffe2/test:quantization -- 'test_compare_model_stub_linear_dynamic'

buck test mode/dev caffe2/test:quantization -- 'test_compare_model_outputs_conv_static'

buck test mode/dev caffe2/test:quantization -- 'test_compare_model_outputs_linear_static'

buck test mode/dev caffe2/test:quantization -- 'test_compare_model_outputs_functional_static'

buck test mode/dev caffe2/test:quantization -- 'test_compare_model_outputs_linear_dynamic'

Differential Revision: D22058275

fbshipit-source-id: 76cb42ce16b6b02b0b90f7582252756582660921
2020-06-20 07:00:13 -07:00
0079e429d6 Remove incorrect warning message on rounding mode (#40301)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40301

ghstack-source-id: 106258861

Test Plan: Fix warning message

Differential Revision: D22143261

fbshipit-source-id: 73a3b09ea82eb470c6702a413d1f984bbf38b3ea
2020-06-20 02:09:44 -07:00
9da277c635 [quant][graphmodel] linear_relu (#40021)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40021

This replaces #36889 due to significant merge conflicts

Test Plan: Imported from OSS

Differential Revision: D22087061

Pulled By: z-a-f

fbshipit-source-id: 6a65cdd3c0c0c957968a9d017902fb6d03b58150
2020-06-19 23:32:54 -07:00
e04a611b91 [quant][graphmode] clang format changes (#40329)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/40329

Test Plan: Imported from OSS

Differential Revision: D22149706

fbshipit-source-id: 3c07cb0c09a53a01fc69185943ddc409264a6ff5
2020-06-19 23:22:43 -07:00
59ca1d31ca [quant][graphmode] docstrings for top level APIs (#40328)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/40328

Test Plan: Imported from OSS

Differential Revision: D22149708

fbshipit-source-id: 63a1cd229d9e4668fba0ef3977e894cb8984318b
2020-06-19 22:20:23 -07:00
7a837019a4 [caffe2] optimize 2/4-bit row-wise quantization (#387)
Summary:
Pull Request resolved: https://github.com/pytorch/FBGEMM/pull/387

Pull Request resolved: https://github.com/pytorch/pytorch/pull/39985

avx2 optimized 2/4-bit row-wise quantization/dequantization in perfkernels.
This diff slightly change the numerics of quantization by multiplying with the inverse of scale instead of dividing with scale.

Test Plan:
In my devserver

for i in 2 4 8; do echo $i; buck run mode/opt :fused_rowwise_nbit_conversion_bench -- --bit-rate=$i; done

Before this diff
2-bit
        3.35394 ms.        100%. FloatToFused2BitRowwiseQuantized
4-bit
        3.60351 ms.        100%. FloatToFused4BitRowwiseQuantized
8-bit
       0.434467 ms.        100%. FloatToFused8BitRowwiseQuantized

After this diff

2-bit
       0.606386 ms.        100%. FloatToFused2BitRowwiseQuantized
4-bit
       0.446683 ms.        100%. FloatToFused4BitRowwiseQuantized
8-bit
         0.4349 ms.        100%. FloatToFused8BitRowwiseQuantized

Reviewed By: choudharydhruv, jianyuh

Differential Revision: D22033195

fbshipit-source-id: d3a219e47b8345268d90a160c9314ed0d5b71467
2020-06-19 21:28:31 -07:00
cfe1c6ef9e Update XLAPreAutograd keys. (#40265)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/40265

Differential Revision: D22137998

Pulled By: ailzhang

fbshipit-source-id: 41edac06f8aafa5d4c1dcefd5da81be6c9ac4a9c
2020-06-19 21:12:50 -07:00
5c133eb2db fix small typo in optim adamw (#40283)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/40283

Test Plan: Imported from OSS

Differential Revision: D22138796

Pulled By: glaringlee

fbshipit-source-id: 2c3a35f7e539b43ee5abf8dbc10b95df5d62fccb
2020-06-19 19:10:17 -07:00
4b028a8e07 [jit] support pad_sequence/pack_sequence (#39844)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/39844

Test Plan: Imported from OSS

Reviewed By: jamesr66a

Differential Revision: D22026720

Pulled By: wanchaol

fbshipit-source-id: cc51ea77eff3689e319ec7e89a54c788646b5940
2020-06-19 19:03:14 -07:00
4f761f325c Back out "[pytorch][PR] Removes dunder div"
Summary: NVIDIA's Apex is updating to no longer rely on this behavior, but we're reverting this Python2->Python3 update to unblock internal apex users.

Test Plan: Sandcaslte + OSS CI.

Reviewed By: ngimel

Differential Revision: D22146782

fbshipit-source-id: f9483d2cbf9dc3a469ad48a6c863edea3ae51070
2020-06-19 18:31:20 -07:00
5555d210b1 Cleanup TensorIteratorDynamicCasting.h (#39839)
Summary:
std::complex, and thrust::complex has gone
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39839

Differential Revision: D22139528

Pulled By: ngimel

fbshipit-source-id: 535e8c137212338569c83c46ed6fd829934e4043
2020-06-19 18:17:50 -07:00
b2f489dc57 [quant][graphmode] Rename graph mode quantization API to quantize_jit (#40212)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/40212

Test Plan: Imported from OSS

Reviewed By: z-a-f

Differential Revision: D22144745

fbshipit-source-id: 38a19b5afdddbbce262eea8ddf5b68458e6017b3
2020-06-19 18:13:37 -07:00
6d70d1574f rename the LayerNorm operator and add it to the replacement map (#40318)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40318

rename layernom fakefp16 to the right naming convention
add it to the map of replacement ops

this can be done even if the operator is not complete because we are blacklisting anyways

Test Plan: net_runner and inspected the log that replacement happened

Reviewed By: venkatacrc

Differential Revision: D22145900

fbshipit-source-id: f19794ec05234b877f7697ed8b05dd8f46606c47
2020-06-19 16:49:22 -07:00
fb17b05f33 Make dynamic casting case also benefit from unrolling (#34749)
Summary:
This is based on https://github.com/pytorch/pytorch/issues/34708, I didn't use stacked diff because is not very convenient for cherry-picking. Please review after https://github.com/pytorch/pytorch/issues/34708 merged.

**Legacy kernels are now completely gone. And the rewrite of GPU loops is done.**

Benchmark shows big improvements in performance on RTX 2080ti:
https://github.com/zasdfgbnm/things/blob/master/2020Q1/benchmark-unroll-with-dyn-casting.ipynb
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34749

Differential Revision: D22139597

Pulled By: ngimel

fbshipit-source-id: 5995744c339afee331f15ea2e483c6acf3ce0c62
2020-06-19 16:43:46 -07:00
4194456158 Add _enable_record_function python API (#40306)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40306

Adding _enable_record_function

Test Plan: CI

Differential Revision: D22143026

fbshipit-source-id: dc466ad3303cb1d52a66aab74ba668e36bab5458
2020-06-19 16:08:00 -07:00
a80dd02a22 [Resubmit] Ensure NCCL_BLOCKING_WAIT=1 works for dist.barrier() (#40249)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40249

Blocking wait didn't work for dist.barrier() since we performed a
cudaDeviceSynchronize() before we performed any of the timeout checks. As a
result, in case of failures/desync the barrier() call would get stuck on
cudaDeviceSynchrnonize() and would never return a timeout error to the user.

To fix this, I've moved the device synchronization after the timeout checks.
ghstack-source-id: 106250153
ghstack-source-id: 106250153

Test Plan: waitforbuildbot

Differential Revision: D22126152

fbshipit-source-id: d919a7a6507cca7111d8ad72e916777b986d0d67
2020-06-19 15:42:43 -07:00
314d645e05 Add a warning to mention that async_execution does not work with autograd profiler (#40309)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/40309

Test Plan: Imported from OSS

Differential Revision: D22145130

Pulled By: mrshenli

fbshipit-source-id: d6f7250e53648d6939367f1ad4c9b898be00afed
2020-06-19 15:35:00 -07:00
5d0044389a Minor RPC doc improvements (#40305)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/40305

Test Plan: Imported from OSS

Differential Revision: D22144304

Pulled By: mrshenli

fbshipit-source-id: 1c8a9648043eabaf909c6e4ae116672396a9f0f5
2020-06-19 15:34:58 -07:00
a9f0156271 Fix RRef to_here() docs (#40300)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/40300

Test Plan: Imported from OSS

Differential Revision: D22143252

Pulled By: mrshenli

fbshipit-source-id: 85a5b7a7bab9ad29fe71064c927b059dd1ab39f9
2020-06-19 15:34:56 -07:00
caf0c286b8 Fix RPC API doc links (#40299)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/40299

Test Plan: Imported from OSS

Differential Revision: D22143156

Pulled By: mrshenli

fbshipit-source-id: c11848ebfe8863d59509a0fbc042eed71a58e514
2020-06-19 15:34:53 -07:00
d6d579397d Improve docs for init_rpc (#40298)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/40298

Test Plan: Imported from OSS

Differential Revision: D22143155

Pulled By: mrshenli

fbshipit-source-id: deadcc29eda157b401ca6a091c3ba17455acb6b5
2020-06-19 15:34:51 -07:00
3ca05500fa Improve RPC documents (#40296)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40296

1. Added a link to parameter server tutorial
2. Explained current states for TorchScript support

Test Plan: Imported from OSS

Differential Revision: D22142647

Pulled By: mrshenli

fbshipit-source-id: ffd697dd64a3aa874cf3f3488122ed805903370d
2020-06-19 15:34:49 -07:00
4463f59c2c Let torch.futures.wait_all re-throw errors (#40291)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/40291

Test Plan: Imported from OSS

Differential Revision: D22141702

Pulled By: mrshenli

fbshipit-source-id: 50b5e5c687e87930aef3a50cc40839729a4eb9c6
2020-06-19 15:32:56 -07:00
f92089b8ca [pytorch] tweak code analyzer script to handle new namespaces (#40276)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40276

- add a couple new namespaces;
- handle the case where both contextual namespace and opreator namespace
  are set (BackendSelectRegister.cpp and #39401);
- improve error message;

Test Plan: Imported from OSS

Differential Revision: D22135686

Pulled By: ljk53

fbshipit-source-id: 14d359c93573349b8fe1e05d7e44d875295a5f6d
2020-06-19 14:54:21 -07:00
6df97c20c2 Make test case precision property (#40057)
Summary:
Make `common_utils.TestCase.precision` a property, because it is overriden as such in `common_device_type`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40057

Differential Revision: D22138385

Pulled By: malfet

fbshipit-source-id: 0e7c14654bf60f18f585efc61f96fdd0af23346f
2020-06-19 14:24:55 -07:00
c73095e78f Add note to serialization docs about zipfile format (#40288)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/40288

Test Plan: Imported from OSS

Differential Revision: D22140324

Pulled By: jamesr66a

fbshipit-source-id: 01d7aa642ed2f4e4bdac4b7f3223bf4d7e62fd4d
2020-06-19 13:40:08 -07:00
73a156e81f [ONNX] Update pytorch/onnx docs for new export API args (#39802)
Summary:
Update pytorch/onnx docs for new export API args:
Use external data format and Training args.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39802

Reviewed By: hl475

Differential Revision: D22139664

Pulled By: houseroad

fbshipit-source-id: 7d6dcf75129cb88987f8c37b7d9d48ca594c0f38
2020-06-19 13:38:47 -07:00
41865d8f19 [ONNX] Update black_listed_operators for opset 12 (#39414)
Summary:
Remove black_listed_operators for opset 12 as we now support these ops.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39414

Reviewed By: hl475

Differential Revision: D21915584

Pulled By: houseroad

fbshipit-source-id: 37ec7bdd2b5a845484535054026d6613d0921b7a
2020-06-19 13:33:25 -07:00
65f67bbe92 improvements to sls 4bit
Summary: enhance the sls test to reflect the shapes and values

Test Plan: ran sls tests on device and emulator

Reviewed By: amylittleyang

Differential Revision: D22094433

fbshipit-source-id: 610a79433ae6c58f626b5984a3d89d9e1bbf4668
2020-06-19 13:30:53 -07:00
c3ce35e67b Update TensorPipe submodule
Summary:
This is to import a few features:
- a fix to a race condition happening in SHM's use of epoll
- a new XTH channel, that uses a memcpy to transfer between threads of the same process
- a new MPT channel, that chunks and multiplexes tensors over multiple transport event loops

Test Plan: Run in CircleCI

Reviewed By: patricklabatut

Differential Revision: D22140736

fbshipit-source-id: a3cee8a3839d98a42b8438844a9fd24fd85b2744
2020-06-19 13:22:06 -07:00
b48742322a move ROCm 3.5 thunk upgrade from build.sh into test.sh (#40286)
Summary:
https://github.com/pytorch/pytorch/issues/40181 incorrectly placed the thunk work-around into the build.sh scripts.  It needed to be in test.sh.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40286

Differential Revision: D22140366

Pulled By: xw285cornell

fbshipit-source-id: 2a3d73594d1963c8c80cd8c45d06f1c963b9cbee
2020-06-19 12:30:30 -07:00
ca0540a7eb Remove variable shadowing from tensorpipe lambda (#39126)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39126
futureResponseMessage is shadowed in the pipeWrite lambda which
creates some confusion, since it is used in the initial error handling but then
a future of the same name is created when marking the future as completed. This
change removes this by getting rid of the futureResponseMessage capture,
instead capturing the message id. This change also makes it so that we don't
need to copy it into the lambda.
ghstack-source-id: 106211353

Test Plan: CI

Differential Revision: D22127398

fbshipit-source-id: c98a53b5630ce487461e4ca9cd72fbd34788298d
2020-06-19 12:25:42 -07:00
cdbf78fba0 Revert D22118945: [android] test_app example linking to pytorch_android aar content
Test Plan: revert-hammer

Differential Revision:
D22118945 (52a2adb3f4)

Original commit changeset: 31c54b49b1f2

fbshipit-source-id: 0c4929d4441572debbbc49f8674b9fc49b726599
2020-06-19 12:16:18 -07:00
465138ec39 refactoring TestQuantizeScript (#39677)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/39677

Test Plan:
Moved a test class suite between files, wanted to have same functionality (simple code refactor) so tested to make sure the test output was the same before/after the refactor.
Image below shows the output of TestGraphModePostTrainingStatic before refactor

{F239676498}

This image shows the output of TestQuantizeScript (renamed version that is in test_quantize_script.py instead of test_quantize.py)

{F239676509}

Differential Revision: D21940638

Pulled By: edmundw314

fbshipit-source-id: 54160a5151aadf3a34bdac2bcaeb52904e6653ed
2020-06-19 11:47:00 -07:00
3684dfafc2 Fix typos in RPC examples (#40280)
Summary:
There has a missing '=' in rpc_sync call in RPC example.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40280

Differential Revision: D22137619

Pulled By: mrshenli

fbshipit-source-id: f4e4b85f68fd68d29834e199416176454b6bbcc2
2020-06-19 11:43:11 -07:00
b670ff2d3a Add typing for _CudaStreamBase and _CudaEventBase classes (#40256)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/40256

Differential Revision: D22139369

Pulled By: malfet

fbshipit-source-id: c7f4f8709700eb85d971ad504dd3552e311cb58d
2020-06-19 11:29:41 -07:00
52e4e3a9b8 NCCL Comment typo fix (#40242)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40242

Comment Typo in ProcessGroupNCCL
ghstack-source-id: 106088379

Test Plan: CI

Differential Revision: D22099219

fbshipit-source-id: ddce91e640d4eea54e0698166c6276aeffedeb1e
2020-06-19 11:24:52 -07:00
d9c804ce22 [PyTorch Numeric Suite] Add support for dynamic quantization of linear module (#39024)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39024

Add support for dynamic quantization of linear module.
ghstack-source-id: 106205450

Test Plan:
buck test mode/dev caffe2/test:quantization -- 'test_compare_weights_conv_static'

buck test mode/dev caffe2/test:quantization -- 'test_compare_weights_linear_static'

buck test mode/dev caffe2/test:quantization -- 'test_compare_weights_linear_dynamic'

buck test mode/dev caffe2/test:quantization -- 'test_compare_model_stub_conv_static'

buck test mode/dev caffe2/test:quantization -- 'test_compare_model_stub_linear_static'

buck test mode/dev caffe2/test:quantization -- 'test_compare_model_stub_submodule_static'

buck test mode/dev caffe2/test:quantization -- 'test_compare_model_stub_functional_static'

buck test mode/dev caffe2/test:quantization -- 'test_compare_model_stub_linear_dynamic'

buck test mode/dev caffe2/test:quantization -- 'test_compare_model_outputs_conv_static'

buck test mode/dev caffe2/test:quantization -- 'test_compare_model_outputs_linear_static'

buck test mode/dev caffe2/test:quantization -- 'test_compare_model_outputs_functional_static'

buck test mode/dev caffe2/test:quantization -- 'test_compare_model_outputs_linear_dynamic'

Differential Revision: D21675971

fbshipit-source-id: c9562744dc59b61cf47f2787a934e6a5a53e12fd
2020-06-19 10:58:56 -07:00
07e581d639 Remove useless name check for inputs (#4618)
Summary:
Pull Request resolved: https://github.com/pytorch/glow/pull/4618

`onnxInputNames_` originated from positional name binding. This is inherited from C2, where in C2 inputs are bound by position. So it's useless to check the name here as like as `onnxInputNames_` is filled. If should save cycles on string comparison.

Test Plan: run it.

Reviewed By: jackm321

Differential Revision: D22104338

fbshipit-source-id: 250463744aa37ed291aebd337e26d573048583ff
2020-06-19 10:05:26 -07:00
96057c0080 Fix missing deprecation warning for Tensor.nonzero(). (#40187)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40187

There were two issues:
1) The hand-written definition included an ambiguous default, which made the deprecated signature not selected.  This didn't match the handwritten torch.nonzero, now they do.
2) A parsing bug for empty argument lists meant the signature wasn't being marked as deprecated.

Test Plan: Imported from OSS

Differential Revision: D22118236

Pulled By: gchanan

fbshipit-source-id: a433ce9069fef28aea97cbd76f2adf5a285abd73
2020-06-19 09:24:48 -07:00
ece8ef2fc6 Run canonical graph optimizations in optimize_for_mobile. (#38840)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38840

JIT graph executor runs some canonical optimizations such as cse, dead code
elimination etc before constructing code that interpreter executes.
Since we do not have full JIT in lite interpreter any such graph optimizations
must happen AOT.
This diff applies such canonical optimizations on graph.

Test Plan: CI's test_mobile_optimizer.

Reviewed By: dreiss

Differential Revision: D21675855

fbshipit-source-id: 5dd898088ef8250103ccbbb6aa2bbce156a8d61d
2020-06-19 09:19:29 -07:00
a11870b45d Revert D22118971: [android] gradle version update
Test Plan: revert-hammer

Differential Revision:
D22118971 (262ad8e6ab)

Original commit changeset: 566e45e8f6f7

fbshipit-source-id: 74cfec0c978b724d84460a6d0c98f97b389811f7
2020-06-19 08:48:21 -07:00
b0324a97f5 _jit_pass_fold_convbn wrapped with fuse_conv_bn_script (#40224)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/40224

Test Plan: Imported from OSS

Differential Revision: D22117111

Pulled By: edmundw314

fbshipit-source-id: 9252674bd770ba6669d50090849d9f9bc13edaa3
2020-06-19 08:19:40 -07:00
b7bfdcbe3e [caffe2/torch] Use logger in jit instantiator
Summary:
Previously the module would log some data using `print()`. This can be
a problem when used in contexts where the process expects to write data to
stdout itself. This diff changes the log statements to use `logger` instead.
This makes it similar to other log statements in the same module.

Test Plan:
Confirmed no weird test showed up when running:

buck test caffe2/test/distributed/nn/api:remote_module_fork

Differential Revision: D22136172

fbshipit-source-id: a3d144eba6c75925ed684981793c84b36eb45a5d
2020-06-19 07:49:15 -07:00
2393bab036 [TensorPipe] Update documentation (#40222)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40222

Mention the TensorPipe agent in the RPC docs and give users the information they need to choose which agent to use.
ghstack-source-id: 106225711

Test Plan: Export to GitHub, build locally and try out the docs.

Differential Revision: D22116494

fbshipit-source-id: 30703ba8410c40f64e785f60d71dfd9faa8de4a1
2020-06-19 04:26:49 -07:00
8315bb2359 Back out "[pytorch][PR] [JIT] Infer NamedTuple type attributes of nn.Modules correctly" (#40270)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40270

Original commit changeset: 1227e243ab94

D22082806 (1e03d603c6) broke the model generation of pyper models. We trace the namedtuple as input. To unblock the development of PyPer project, let's revert the diff first.

Sorry about the inconvenience, SplitInfinity
ghstack-source-id: 106217609

Test Plan: buck run dper3/dper3_models/experimental/pytorch/feed:feed_generation_script -- --model_files_dir=/tmp/

Reviewed By: alyssawangqq

Differential Revision: D22132960

fbshipit-source-id: ce9278c8462602a341e231ea890e46f74e743ddf
2020-06-19 02:58:31 -07:00
86b1afa039 Assert that kernels are called with the right signature (#40251)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40251

Rather than segfaulting, we should show a good error message when in op.call<Return, Args...>(...) the Return type or Args types mismatch the kernel.

This adds an assertion comparing two std::type_index to the call path, but that should be fast. Hashing the function signature is also in the call path and not strictly constexpr, but I checked on godbolt that GCC >=5 and Clang >=3.8 optimize it away and make it constexpr, i.e. it's not part of the assembly.
ghstack-source-id: 106194240

Test Plan: waitforsandcastle

Differential Revision: D22126701

fbshipit-source-id: 6c908a822e295757bcc0014f78f51e6a560f221f
2020-06-18 21:54:05 -07:00
02e091902f Release DistAutogradContainer context for each dist_autograd test case (#38711)
Summary:
this fixes - https://github.com/pytorch/pytorch/issues/38710
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38711

Differential Revision: D22132057

fbshipit-source-id: 894280d164543c63beaec679c18f2059e7055b01
2020-06-18 20:58:55 -07:00
6e2c88980e .circleci: Add git to the ecr gc docker images (#40262)
Summary:
Fixes GC jobs failing

Signed-off-by: Eli Uriegas <eliuriegas@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40262

Differential Revision: D22131192

Pulled By: seemethere

fbshipit-source-id: 182eb5f2f49889e6ef19817130e155c52bec2060
2020-06-18 20:37:35 -07:00
ccea3726da [Reland #4] Include AT_PARALLEL_OPENMP/AT_PARALLEL_NATIVE/AT_PARALLEL_NATIVE_TBB to ATen/Config.h (#40211)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/39471

Reland of https://github.com/pytorch/pytorch/issues/39612 #39881 https://github.com/pytorch/pytorch/issues/40045 #40122

Proof: [green TBB test](https://app.circleci.com/pipelines/github/pytorch/pytorch/182769/workflows/ae9f4f7a-791a-49df-9625-e2f0a51e70e7/jobs/5910591/steps)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40211

Reviewed By: malfet

Differential Revision: D22128537

Pulled By: pbelevich

fbshipit-source-id: 98c589405daafc2c81f76e1d5c1aef5e57065351
2020-06-18 20:19:55 -07:00
a6420b8c75 Increase bazel test timeout to 8 minutes (#40263)
Summary:
Intermittently `integration_test` hits default 5 min timeout
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40263

Differential Revision: D22131334

Pulled By: malfet

fbshipit-source-id: 128d3b6882ac5c1b60229a8e0cd2752b817191b5
2020-06-18 20:07:59 -07:00
8f51c39649 Improve torch.futures docs (#40245)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/40245

Test Plan: Imported from OSS

Differential Revision: D22126892

Pulled By: mrshenli

fbshipit-source-id: e7d06b9b20ac8473cc6f0572dd4872096fd366c3
2020-06-18 18:47:25 -07:00
13bd5992d0 Remove finalize_bucket_sparse from DDP (#40130)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40130

The sparse gradients for the model and the tensor that is used to
perform allreduce in DDP are essentially the same and have the same storage. As
a result, once allreduce is done, the sparse gradients are automatically
updated and unlike dense gradients we don't need to assign the bucket's
contents back to the grad.

In addition to this, I've also added a test for distributed autograd to ensure
it works correctly for sparse gradients. I discovered `finalize_bucket_sparse`
was redundant as part of this test since it passed without any changes needed
to `finalize_bucket_sparse` which only looked at the `.grad` field.
ghstack-source-id: 106090063

Test Plan: waitforbuildbot

Differential Revision: D22080004

fbshipit-source-id: 493ce48b673f26b55dffd6894a3915dc769839f6
2020-06-18 17:07:45 -07:00
7e82382ad5 Allow profiler to be enabled remotely with RPC (#38748)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38748

This diff contains the message scaffolding and profiler changes in order to be able to remotely run the profiler across different nodes and aggregate the results on a single node.

As discussed, we have implemented this by creating new message types, that similar to autograd messages, wrap the profiling information with the original message, and send this new message over the wire. On the receiving end, this wrapped message is detected, we fetch the original message from it, and process the original message with the profiler enabled. When sending a response with profiling information, we serialize the profiled `Events` and send them back over RPC. When such a message is received, the events profiled on the remote node are stored (added back to the local profiler).

Changes in this PR:
- New message types (run_with_profiling_req, run_with_profiling_resp) to send profiling info over the wire. Message parsing logic is added to handle these wrapped types.
- Handling of sending profiler data over the wire, in particular, the attributes of the `ProfilerConfig` and the serialized profiled `Event`s
- The logic for wrapping RPC messages is deduped with that in `rpc_with_autograd`, and the common payload wrapping/unwrapping logic is moved to helper functions in `rpc/utils.cpp`
- Changes in `autograd/utils.cpp` to detect if we have enabled the profiler and are sending an RPC, if so, uses the above new message types
- Changes in request_callback to parse and turn on the profiler in a thread-local fashion
- Serialization and deserialization of profiling `Events`, and support to add the remote events to the thread-local profiler
- Introduction of the concept of `node_id`, which as discussed with ilia-cher , will be used along with the `Event`s handle attribute to distinguish between events. When there are events from different nodes, this node information is rendered in the profile output (e.g. when printing tables), otherwise, it is not, since it is irrelevant.
- Some changes to profiler.cpp to add useful helper methods/guards
- toHere() is now profiled for RRefs
- Unittests
ghstack-source-id: 106134626

Test Plan: Added unittests, existing profiler unittests.

Differential Revision: D19510010

fbshipit-source-id: 044347af992f19a9e3b357c9567f6fc73e988157
2020-06-18 17:01:57 -07:00
d58b8222b7 [JIT] Add support for with statements (#34705)
Summary:
**Summary**
This commit adds support for with statements to PyTorch JIT. Each
of the with items in a with statement is represented in the JIT IR
as a pair of `prim::Enter` and `prim::Exit` nodes that call the
`__enter__` and `__exit__` methods defined on the context manager objects
returned by the expressions in the with item.

**Testing**
This commit adds unit tests for with statements with named with items,
nameless with items, and with statements that encounter exceptions.
```
$ python test/test_jit.py TestWith.test_with_as
Fail to import hypothesis in common_utils, tests are not derandomized
.
----------------------------------------------------------------------
Ran 1 test in 0.430s

OK
```

```
$ python test/test_jit.py TestWith.test_with_no_as
Fail to import hypothesis in common_utils, tests are not derandomized
.
----------------------------------------------------------------------
Ran 1 test in 0.264s

OK
```

```
$ python test/test_jit.py TestWith.test_with_exceptions
Fail to import hypothesis in common_utils, tests are not derandomized
Couldn't download test skip set, leaving all tests enabled...
.
----------------------------------------------------------------------
Ran 1 test in 1.053s

OK
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34705

Differential Revision: D22095945

Pulled By: SplitInfinity

fbshipit-source-id: f661565a834786725259b8ea014b4d7532f9419d
2020-06-18 16:57:18 -07:00
8c73e74fdf Clean up thrust::complex usage in geometric kernels (#39293)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/39293

Differential Revision: D22123705

Pulled By: anjali411

fbshipit-source-id: cb83a9c93d1d5e9d499e52ecec61f3dc025f430c
2020-06-18 16:52:32 -07:00
9788a74da8 [quant][bug] Fix histogram observer with 0 input (#40191)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40191

When the first couple of inputs passed to histogram observer are all 0's subsequent non-zero inputs cause a div by 0 error

Test Plan:
python test/test_quantization.py TestHistogramObserver.test_histogram_observer_zero_inputs

Imported from OSS

Differential Revision: D22119422

fbshipit-source-id: 8bbbba914ba7f343121830c768ca0444439f8e03
2020-06-18 16:33:50 -07:00
262ad8e6ab [android] gradle version update (#40176)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/40176

Test Plan: Imported from OSS

Differential Revision: D22118971

Pulled By: IvanKobzarev

fbshipit-source-id: 566e45e8f6f7aa357c98976ad9981c76d4c66a7f
2020-06-18 16:28:34 -07:00
52a2adb3f4 [android] test_app example linking to pytorch_android aar content (#39587)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39587

Example of using direct linking to pytorch_jni library from aar and updating android/README.md with the tutorial how to do it.

Adding `nativeBuild` dimension to `test_app`, using direct aar dependencies, as headers packaging is not landed yet, excluding `nativeBuild` from building by default for CI.

Additional change to `scripts/build_pytorch_android.sh`:

Skipping clean task here as android gradle plugin 3.3.2 exteralNativeBuild has problems with it when abiFilters are specified.

Will be returned back in the following diffs with upgrading of gradle and android gradle plugin versions.

Test Plan: Imported from OSS

Differential Revision: D22118945

Pulled By: IvanKobzarev

fbshipit-source-id: 31c54b49b1f262cbe5f540461d3406f74851db6c
2020-06-18 16:26:25 -07:00
954a59a2f5 Add at::tensor(complex) and torch::tensor(complex) overload (#39793)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/39793

Differential Revision: D22067181

Pulled By: anjali411

fbshipit-source-id: 3cec1289a8aa3a9cc6bd1fcdb2974f858f75f7bd
2020-06-18 16:20:27 -07:00
35f357927d [futures] Add specific python unittest coverage for collect_all/wait_all (#40233)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40233

There was a question earlier whether torch.futures.wait_all() would
raised if the underlying futures raise (it was supposed to, but no test
coverage). This change adds a couple very basic torch.futures.collect_all/
wait_all tests.
ghstack-source-id: 106168134

Test Plan: buck test mode/dev-nosan caffe2/test:futures

Differential Revision: D22120284

fbshipit-source-id: 3a8edae5dbf8c58c8361eff156c386a684ec5e86
2020-06-18 16:14:10 -07:00
8b5732e8ad Move torch.cuda annotations inline (#40075)
Summary:
Also enable `torch.cuda` typechecking
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40075

Differential Revision: D22121275

Pulled By: malfet

fbshipit-source-id: dbecef09911334e8f3d87f5ecab66349da9f2325
2020-06-18 15:52:29 -07:00
c1958de49d [Codemod][FBSourceGoogleJavaFormatLinter] Daily arc lint --take GOOGLEJAVAFORMAT
Reviewed By: zertosh

Differential Revision: D22112813

fbshipit-source-id: 18ec732d7fc9752d5ed84e1cbb1e455e39e65d1e
2020-06-18 15:36:44 -07:00
41f2dbde31 Add AdamW to C++ frontend (#40009)
Summary:
Slightly modified Adam, following the python implementation, and the `ProducesPyTorchValues` tests pass. I had a problem with another test though (see commit c1a6241676ab84fc531c1c3a10f964aa5704092e), it seems that optimizing for two steps with the same optimizer vs optimizing for two steps using freshly initialized objects will produce the same output.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40009

Differential Revision: D22096053

Pulled By: glaringlee

fbshipit-source-id: a31a8f5488cb37c53752ddf15436efabdba67dc4
2020-06-18 15:28:12 -07:00
89ef8f8141 add test_openmp to ROCM_BLACKLIST (#40204)
Summary:
This test is flaky for rocm platform.  Add to blacklist until it can be further reviewed.

CC ezyang xw285cornell sunway513
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40204

Differential Revision: D22108295

Pulled By: xw285cornell

fbshipit-source-id: 802444a7b41260edcb6ce393237784f3e6c52a74
2020-06-18 15:15:35 -07:00
430d5cec0e print position of the operator that failed to onnxifi (#40232)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40232

if an operator failed to onnxifi due to lack of support (not because of missing shapes), print out the position of such op, which can be used to feed net runner

Test Plan: I0618 09:25:06.299002 1570804 onnxifi_transformer.cc:1232] Don't support c2 op SparseLengthsSumFused4BitRowwise at pos 246 (1030)

Reviewed By: hl475

Differential Revision: D22120055

fbshipit-source-id: a3c68b93b7e38dfda5d70168e7541021a8e16dcb
2020-06-18 15:08:39 -07:00
cb8b2f0636 Revert D21534052: Assert that kernels are called with the right signature
Test Plan: revert-hammer

Differential Revision:
D21534052

Original commit changeset: 6be436a3f205

fbshipit-source-id: a149c5ca7f9e78947ae3059ac4470712f291660b
2020-06-18 15:00:13 -07:00
85128113f9 [Selective build] Enable selective build in VariablType
Summary:
Quick fix due to code merging. With this feature working, the total size reduction in Android is 664 KB (Pytorch -26 KB and papaya - 639 KB)
https://fburl.com/unigraph/c726gvb1

Test Plan: CI

Reviewed By: kwanmacher

Differential Revision: D22053779

fbshipit-source-id: 8da4a651432b453c25e543bc64dbed02946de63d
2020-06-18 14:31:09 -07:00
55cdd31bd0 Assert that kernels are called with the right signature (#38361)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38361

Rather than segfaulting, we should show a good error message when in op.call<Return, Args...>(...) the Return type or Args types mismatch the kernel.

This adds an assertion comparing two std::type_index to the call path, but that should be fast. Hashing the function signature is also in the call path and not strictly constexpr, but I checked on godbolt that GCC >=5 and Clang >=3.8 optimize it away and make it constexpr, i.e. it's not part of the assembly.

supersedes D17485438

ghstack-source-id: 106178820

Test Plan: waitforsandcastle

Differential Revision: D21534052

fbshipit-source-id: 6be436a3f20586277a051d764af29e21d5567da0
2020-06-18 14:22:48 -07:00
d14d47b9b5 Get rid of global constructors in cuda codegen (#40183)
Summary:
Use switch instead of look ups in global std::unordered_maps<> to do enum-to-name conversions.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40183

Reviewed By: malfet

Differential Revision: D22117731

Pulled By: ionsphere

fbshipit-source-id: d150114cfae5b1222bb9142d815f2379072506c7
2020-06-18 13:54:11 -07:00
0891764e80 [android] ANDROID_STL=c++_shared (#39588)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39588

Before this diff we used c++_static linking.
Users will dynamically link to libpytorch_jni.so and have at least one more their own shared library that probably uses stl library.

We must have not more than one stl per app. ( https://developer.android.com/ndk/guides/cpp-support#one_stl_per_app )

To have only one stl per app changing ANDROID_STL way to  c++_shared, that will add libc++_shared.so to packaging.

Test Plan: Imported from OSS

Differential Revision: D22118031

Pulled By: IvanKobzarev

fbshipit-source-id: ea1e5085ae207a2f42d1fa9f6ab8ed0a21768e96
2020-06-18 13:50:05 -07:00
d3b786afdb [android] Add libtorch headers to pytorch_android aar (#39507)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39507

Adding gradle task that will be run after `assemble` to add `headers` folder to the aar.

Headers are choosed for the first specified abi, they should be the same for all abis.

Adding headers works through temporary unpacking into gradle `$buildDir`, copying headers to it, zipping aar with headers.

Test Plan: Imported from OSS

Differential Revision: D22118009

Pulled By: IvanKobzarev

fbshipit-source-id: 52e5b1e779eb42d977c67dba79e278f1922b8483
2020-06-18 13:47:18 -07:00
83d7718c5c .circleci: Add docker builds based on rev-parse (#40194)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40194

Adds the scaffolding for doing docker builds based off git rev-parse
tags to detect changes.

Basically allows us to do our previous builds while also prepping for
the new builds by just retagging our current builds as the new ones and
telling the garbage collector not to reap them.

Should also skip out on redundant builds if the image already exists
thus saving us some compute time on docker builds.

Also adds the commands to load the calculated DOCKER_TAG from a shared
workspace file.

Signed-off-by: Eli Uriegas <eliuriegas@fb.com>

Test Plan: Imported from OSS

Differential Revision: D22120651

Pulled By: seemethere

fbshipit-source-id: c74f10816d63f440a9e0cdd00d6fa1a25eb7a2c1
2020-06-18 13:41:17 -07:00
442ec1dd4e [test] split remaining quantization tests out of test_jit (#40144)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40144

as title, split remaining quantization tests out of test_jit to reduce
the size of test_jit

Test Plan: Imported from OSS

Differential Revision: D22085034

Pulled By: wanchaol

fbshipit-source-id: 0c8639da01ffc3e6a72e6f470837786c73a6b3f0
2020-06-18 13:39:13 -07:00
30648985a7 Revert D22108899: Ensure NCCL_BLOCKING_WAIT=1 works for dist.barrier()
Test Plan: revert-hammer

Differential Revision:
D22108899

Original commit changeset: 6b109ef9357e

fbshipit-source-id: 41ca36091a7d4d5e94143835809560362fb14fcd
2020-06-18 13:35:10 -07:00
74a2cb87e3 [android][cmake] Remove NO_EXPORT for libtorch mobile build (#39584)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39584

Removing `-DNO_EXPORT` for not-custom-build to be able to link to C10/A10 api.
Custom build stays the same as its main goal is to have minimum binary size, while export api functions will increase it.

Additional changes:

1. aten/src/ATen/DynamicLibrary.cpp uses libdl, if we need this functionality we will need to link result with libdl, but currently disabling this functionality for mobile.

Test Plan: Imported from OSS

Differential Revision: D22111600

Pulled By: IvanKobzarev

fbshipit-source-id: d730201c55f543c959a596b34be532aecee6b9ab
2020-06-18 11:48:53 -07:00
034eddca01 Fix typos in RPC Docs (#40219)
Summary:
Environment variable MASTER_ADDRESS and MASTER_port should be MASTER_ADDR and MASTER_PORT respectively.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40219

Differential Revision: D22116585

Pulled By: mrshenli

fbshipit-source-id: d312ae66210b0a16ec3ab1f468b1654bb0a75a0f
2020-06-18 11:40:11 -07:00
645d6c014c preserve output tensor's stride in TI's fast setup (#38895)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/38895

Test Plan: Imported from OSS

Differential Revision: D21696586

Pulled By: glaringlee

fbshipit-source-id: c7206dbcf74d30998544e221cd0c998c4c25663a
2020-06-18 11:34:21 -07:00
6a42d85fc6 .circleci: Move docker_build workflow to codegen (#40189)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40189

This is to allow for easier modification later on down the road.

Makes no actual modification to the `.circleci/config.yml`

Signed-off-by: Eli Uriegas <eliuriegas@fb.com>

Test Plan: Imported from OSS

Differential Revision: D22119414

Pulled By: seemethere

fbshipit-source-id: c6cb105d983e43ae1bf289b2d9f734b34a7febe2
2020-06-18 11:19:29 -07:00
aa84ec5325 [quant][api] Expose graph mode quantization API in torch.quantization (#40198)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/40198

Test Plan: Imported from OSS

Differential Revision: D22106542

fbshipit-source-id: 482af0194b8d084dfc76426447e58b86efaa1a59
2020-06-18 10:34:20 -07:00
fef253e711 [codemod][custom_rule] Migrate some scripts to use named outputs for custom_rule
Reviewed By: jermenkoo

Differential Revision: D22069459

fbshipit-source-id: d1e1ed43080f29cfac55af8ff3af571efd10b9de
2020-06-18 10:29:54 -07:00
fcc9a1e664 graph mode: move hardsigmoid op to single_input_general_value category (#40055)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40055

Noticed this while reading the `helper.cpp` file, seems like this op
should be in the `single_input_general_value` bucket.

Test Plan:
CI

Imported from OSS

Differential Revision: D22054257

fbshipit-source-id: 2ca16ff863d644cbd03c3938eeca0fb87e3e4638
2020-06-18 10:21:22 -07:00
37362fff66 graph mode: util for fusion of functions which require observation (#39413)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39413

Implementing the request from
https://github.com/pytorch/pytorch/pull/39095

WIP so we can align on the API, once it looks good
will amend the PR to apply to all relevant functions.

Test Plan:
```
python test/test_quantization.py TestQuantizeScriptPTSQOps.test_hardswish
```

Imported from OSS

Differential Revision: D21885263

fbshipit-source-id: 029339a99f8c50e45dd1dfb7fd89c20e3188720d
2020-06-18 10:21:20 -07:00
4ad8ebe738 quant layer/group/instance norm: make weights and biases optional (#39203)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39203

Adds logic and test coverage for optional weights and biases for
the quantized normalization operators.  This was broken before this
PR because the `TORCH_LIBRARY` registration had these as required parameters
- removed it, and cleaned up the callsites.

Note: consolidating the registrations in `native_functions.yaml` as opposed to `library.cpp`
after a discussion with ezyang .

Test Plan:
```
python test/test_quantization.py TestQuantizedOps.test_qlayer_norm
python test/test_quantization.py TestQuantizedOps.test_group_norm
python test/test_quantization.py TestQuantizedOps.test_instance_norm
python test/test_quantization.py TestStaticQuantizedModule.test_layer_norm
python test/test_quantization.py TestStaticQuantizedModule.test_group_norm
python test/test_quantization.py TestStaticQuantizedModule.test_instance_norm
python test/test_quantization.py TestQuantizeScriptPTSQOps.test_layer_norm
python test/test_quantization.py TestQuantizeScriptPTSQOps.test_group_norm
python test/test_quantization.py TestQuantizeScriptPTSQOps.test_instance_norm
```

Imported from OSS

Differential Revision: D21885259

fbshipit-source-id: 978c7b8bd6c11a03e9e5fdb68f154cb80cc43599
2020-06-18 10:19:39 -07:00
d4e4f13173 [quant][graphmode] Add support for detach (#40197)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/40197

Test Plan: Imported from OSS

Differential Revision: D22106544

fbshipit-source-id: 047236bf8c7cb0813563e6c7bcd41b79dfa6fb2b
2020-06-18 10:01:43 -07:00
5f309505ce Move the check on orig_weight sizes. (#40200)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40200

Since original weights are removed by default in mobile build, the check must
be moved to a place where orig_weight is still valid.

Test Plan:
CI
Plus observed a model run crash which was resolved after this change.

Reviewed By: supriyar

Differential Revision: D22101562

fbshipit-source-id: 9543e69a415beaef2a9fb92dc9cd87d636174d51
2020-06-18 08:34:13 -07:00
f3f30d4354 [JIT x RPC] Consolidate RRef type class and RRef impl class (#35694)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35694

close https://github.com/pytorch/pytorch/issues/35110

Differential Revision: D7881729

fbshipit-source-id: eedda8f1b7510491886d469efeed4e002bb8b991
2020-06-18 07:46:38 -07:00
7c9e78fdf5 [TensorPipe] Add options for agent, including backend killswitches (#40162)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40162

The only public option is `num_worker_threads`. The other ones are private (as indicated by the leading underscore, is that enough?) and allow to specify a different set and order of transports/channels. These can thus be used to disable a backend (by not specifying it) or by forcing one (by raising its priority). They can therefore be used to work around defective backends, in case we'll find any post-release.
ghstack-source-id: 106103238

Test Plan: Built //caffe2:ifbpy and, using TensorPipe's verbose logging, verified that the transports/channels I specified were indeed the ones that were being registered.

Differential Revision: D22090661

fbshipit-source-id: 789bbe3bde4444cfa20c40276246e4ab67c50cd0
2020-06-18 02:54:17 -07:00
d1a0e88075 Ensure NCCL_BLOCKING_WAIT=1 works for dist.barrier() (#40207)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40207

Blocking wait didn't work for dist.barrier() since we performed a
cudaDeviceSynchronize() before we performed any of the timeout checks. As a
result, in case of failures/desync the barrier() call would get stuck on
cudaDeviceSynchrnonize() and would never return a timeout error to the user.

To fix this, I've moved the device synchronization after the timeout checks.
ghstack-source-id: 106123004

Test Plan: waitforbuildbot

Differential Revision: D22108899

fbshipit-source-id: 6b109ef9357e9464e7d66b540caabf5801e6a44a
2020-06-17 23:44:59 -07:00
4553b0b537 Reduce number of Window test configurations (#38482)
Summary:
After this diff, on PR following compilation configuration would be running:
     - VS2017 14.11, CUDA10.1
     - VS2017 no CUDA, CUDA10.1
     - VS2019, CUDA10.1
    And tested:
     - VS2017 14.11, CUDA10.1
     - VS2017 14.11 no CUDA (only 1st half of tests)
     - VS2017 14.11 force on CPU (only 1st half of test)

    And on master, we would be building both VS2017 14.11 and 14.16, but testing only VS2017 14.11 and VS2019 builds.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38482

Differential Revision: D22111743

Pulled By: malfet

fbshipit-source-id: d660e4bc8f4f17a93f1cc18402cd5f2091b7789d
2020-06-17 23:40:31 -07:00
fd7e09a52b [quant][graphmode] Clean up and add more logging (#40196)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40196

- separate passes in insert observers to make it more robust
- added print for quantization type
- added more logging for insert observers

Test Plan: Imported from OSS

Differential Revision: D22106545

fbshipit-source-id: 6d8d722e33c1259b1a6a501853c801c275dbfcff
2020-06-17 23:35:28 -07:00
76fbfba644 Move _dummy_type to _utils.py (#40177)
Summary:
Use it from both __init__ and streams to define dummy types when CUDA is missing
Fix accidental reference of global `storage_name` from `_dummy_type`
Add type annotations
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40177

Differential Revision: D22106922

Pulled By: malfet

fbshipit-source-id: 52fbfd91d70a78eb14d7ffda109c02ad1231497e
2020-06-17 22:50:02 -07:00
efd9fc7434 Remove thrust::complex from sqrt (#39901)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/39901

Differential Revision: D22109238

Pulled By: pbelevich

fbshipit-source-id: 96a72bd0df391b872f8e6d08fe7b5dca61b472ab
2020-06-17 20:39:14 -07:00
edd3fbc61e Add aarch64 specific quantize_tensor using arm intrinsics. (#40113)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40113

Earlier version covered only armv7, aka aarch32. This diff adds aarch64 stuff
as well.
ghstack-source-id: 105990688

Test Plan: CI

Reviewed By: jerryzh168

Differential Revision: D22072779

fbshipit-source-id: c01f0b3f84394710339cf3b791832fcf68fcd4c0
2020-06-17 19:54:11 -07:00
fb02007e9f Export box_cox operator in caffe2
Summary: Export box_cox operator in caffe2

Test Plan: Pass all unit tests

Reviewed By: mingzhe09088

Differential Revision: D21515797

fbshipit-source-id: 777ee5e273caeab671ee2c22d133d3f628fb4a6e
2020-06-17 19:28:53 -07:00
1800032712 [quant][graphmode] Add warning for prim::Loop (#40195)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/40195

Test Plan: Imported from OSS

Differential Revision: D22106543

fbshipit-source-id: c0958cd2f977e8bdbbb1ac1befe51326b4619f94
2020-06-17 18:58:45 -07:00
0b3755b1d0 Add optimization blacklist as second arg to optimizeForMobile method. (#37462)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37462

Instead of running all the optimization pass in optimizeForMobile method,
introducing a whitelist optimizer dictionary as second param in the method,
when it is not passed during calling, the method will run all the optimization
passes, otherwise the method will read the dict and only run the pass with
value of True.
ghstack-source-id: 106104503

Test Plan:
python test/test_mobile_optimizer.py

Imported from OSS

Differential Revision: D22096029

fbshipit-source-id: daa9370c0510930f4c032328b225df0bcf97880f
2020-06-17 18:14:45 -07:00
30364f0b01 Remove obsolete warning message from DDP (#40190)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40190

Fixed by #36503

Test Plan: Imported from OSS

Differential Revision: D22101516

Pulled By: mrshenli

fbshipit-source-id: 9abd6dce602530c11b7fe623ac0f4d556dccc961
2020-06-17 17:58:21 -07:00
74142f76fa Adding torch.futures to API docs (#40051)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/40051

Test Plan: Imported from OSS

Differential Revision: D22055031

Pulled By: mrshenli

fbshipit-source-id: ce8a79ba4ffdc7dbed6d4c62b1c33b96764c89e7
2020-06-17 17:55:48 -07:00
693ab77c00 [test] split onnx export test out of test_jit (#40143)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40143

as titled, to reduce size of test_jit

Test Plan: Imported from OSS

Reviewed By: ailzhang

Differential Revision: D22085036

Pulled By: wanchaol

fbshipit-source-id: 424f189fd3849c111d06ebe2e341da50d98fe0ec
2020-06-17 17:27:50 -07:00
27d789500b [test] split tracer related tests out of test_jit (#40142)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40142

test_jit is becoming huge again, which makes editor hard to load and
write new tests, this split out the tracer related tests.

Test Plan: Imported from OSS

Reviewed By: ailzhang

Differential Revision: D22085035

Pulled By: wanchaol

fbshipit-source-id: 696bee84985ecfbfeac8e2ee5c27f1bdda8de394
2020-06-17 17:26:33 -07:00
e34e32850e Revert D22076711: [Reland #3] Include AT_PARALLEL_OPENMP/AT_PARALLEL_NATIVE/AT_PARALLEL_NATIVE_TBB to ATen/Config.h
Test Plan: revert-hammer

Differential Revision:
D22076711

Original commit changeset: fa7b6335ebb5

fbshipit-source-id: 254b6941482f855e81c666e786fc5a4a1b57864f
2020-06-17 16:49:16 -07:00
a2ef54c598 [pytorch] fix CUDA_KERNEL_ASSERT macro for android build (#40151)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40151

For debug android build it throws the following error:
```
  In file included from src/pytorch/android/pytorch_android/src/main/cpp/pytorch_jni_common.cpp:9:
  In file included from src/pytorch/android/pytorch_android/src/main/cpp/pytorch_jni_common.h:2:
  In file included from ../../../../src/main/cpp/libtorch_include/armeabi-v7a/torch/csrc/api/include/torch/types.h:3:
  In file included from ../../../../src/main/cpp/libtorch_include/armeabi-v7a/ATen/ATen.h:5:
  In file included from ../../../../src/main/cpp/libtorch_include/armeabi-v7a/ATen/Context.h:4:
  In file included from ../../../../src/main/cpp/libtorch_include/armeabi-v7a/ATen/Tensor.h:3:
  In file included from ../../../../src/main/cpp/libtorch_include/armeabi-v7a/ATen/core/TensorBody.h:7:
  In file included from ../../../../src/main/cpp/libtorch_include/armeabi-v7a/c10/core/Scalar.h:13:
  ../../../../src/main/cpp/libtorch_include/armeabi-v7a/c10/util/TypeCast.h:157:22: error: use of undeclared identifier '__assert_fail'
  AT_FORALL_QINT_TYPES(DEFINE_UNCASTABLE)
                       ^
```

Seems __assert_fail() isn't available on Android by default - in NDEBUG mode it forward declares the function and CI passes.

But CUDA_KERNEL_ASSERT() shouldn't be relevant for mobile build at all and we already bypass `__APPLE__` so the easiest fix is to add `__ANDROID__`.

Test Plan: Imported from OSS

Differential Revision: D22095562

Pulled By: ljk53

fbshipit-source-id: 793108a7bc64db161a0747761c0fbd70262e7d5a
2020-06-17 16:26:08 -07:00
3ea15af630 [Onnxifi] Allow adding timeout for OnnxifOp run (#40081)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40081

Adding the functionality to enable timeout of OnnxifiOp run. In the case of backend hanging, it can error out quickly.

Test Plan:
```
 buck test glow/fb/test:test_onnxifinnpi -- test_timeout
```

Reviewed By: jackm321

Differential Revision: D22064533

fbshipit-source-id: 25487287c10ab217eb95692f09d48e13e19436ab
2020-06-17 16:21:25 -07:00
1670ea9474 Remove overload of GPU max_pool3d with kernel_width; fix nan, inf in GPU {fractional,adaptive} max_pool{2,3}d (#39903)
Summary:
Fix https://github.com/pytorch/pytorch/issues/39846.
Fix https://github.com/pytorch/pytorch/issues/39044

The problem was that `max_pool3d_with_indices_single_out_frame` has an overload of kernel_width being a template argument. The two overloaded kernels were supposed to be identical, however, they were not.

The general version
da3073e9b1/aten/src/ATen/native/cuda/DilatedMaxPool3d.cu (L69-L73)

The overloaded version
da3073e9b1/aten/src/ATen/native/cuda/DilatedMaxPool3d.cu (L130-L134)

While the max_pool3d being "switch-case"-ed to the overloaded version, the NaN value comparison is ignored. Also, maintaining two overloaded versions of such a complicated kernel would be hard. I'm not sure if the overloaded version would even give huge performance benefit. So I propose to remove the kernel_width overloaded version.

Also, the current test of max_pool_XD_nan forgot the device kwarg. I added that.

Edit: profiling before and after
script: https://github.com/xwang233/code-snippet/blob/master/maxpool-3d-kw-template-arg/a.py
plot: https://github.com/xwang233/code-snippet/blob/master/maxpool-3d-kw-template-arg/b.ipynb

The performance difference is within +- 5%.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39903

Differential Revision: D22080759

Pulled By: ngimel

fbshipit-source-id: 4dacdd266a0522b3ff432eb9d58b131fa86821e9
2020-06-17 16:18:33 -07:00
7f0e4265ac ROCm thunk work-around for future transition to ROCm 3.5.1 (#40181)
Summary:
ROCm CI hosts will have their kernels upgraded first to ROCm 3.5.1.  CI images will follow soon after.  Due to the thunk/kernel mismatch during the interim, this PR will detect the mismatch and upgrade the thunk during the build.  This PR will be reverted once migration to ROCm 3.5.1 images is complete.

CC ezyang xw285cornell
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40181

Differential Revision: D22104488

Pulled By: xw285cornell

fbshipit-source-id: 7192e1d0bb25bfb814e9a85efb4aa29d0e52b460
2020-06-17 16:17:06 -07:00
f4ffe99da5 Fix flaky rref timeout test (#40141)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40141

This rref timeout test could be flaky because we could end up processing `RRefUserDelete` messages on the owner node before processing the to_here message. This would result in a hang in `ProcessGroupAgent::sync()` that eventually results in a timeout.

The rough sequence of what happens is:
0) Node 0 creates RRef on node 1 with rpc.remote() call
1) rref.to_here() is called with a timeout. Because of delay injection, the processing of this message can be delayed (this is also technically possible in applications without delay injection)
2) At some point, callbacks corresponding to rpc.remote() runs and confirms the rref, adding it as a confirmed user
3) RPC shutdown starts, as part of which we send out RRef user deletes. In this case, 0 sends an RRef user delete to 1, and node 1 removes the owner from the `owners_` field.
4) The `to_here()` message is finally processed by node 1. But since we have deleted the `owner_`, while processing this message we create a future that will be complete when the owner exists (this is to account for the case of to_here() arriving here rpc.remote). But this future will never complete, since the owner is already deleted, so we hang indefnitely

As a workaround for now, we can force `to_here()` to run before RPC shutdown by adding a blocking `to_here()` call with no timeout.

A more robust, longer-term fix would be to detect if an owner has been previously deleted (such as by an RRefUserDelete). Then, we know that the future corresponding to owner creation on the remote end will never completee, and then we error out when processing a `to_here()`.
ghstack-source-id: 106036796

Differential Revision: D22084735

fbshipit-source-id: fe7265a4fe201c4d6d2f480f64fe085cd59dbfb2
2020-06-17 15:48:38 -07:00
34e28ede57 Fix flaky test (#40175)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40175

Check that there is an increasing memory usage in the test

Test Plan: CI

Differential Revision: D22098192

Pulled By: ilia-cher

fbshipit-source-id: bbdbc71f66baf18514332a98d8927441c61ebc16
2020-06-17 15:40:28 -07:00
bc9e8af218 [distributed.nn] Change remote module template instantiator to write to tmp folder (#40173)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40173

- Avoid path sharing across runs and workers, so even the test methods/workers run in parallel on the same host, they don't interfere with each other.
- On some environment (e.g. fb internal CI platform), the torch package file tree is not writable. But the temporary folder chosen by Python `tempfile` module is always writable, on linux it's "/tmp".

close https://github.com/pytorch/pytorch/issues/40120

ghstack-source-id: 106086340

Test Plan:
```
buck test mode/dev-nosan //caffe2/test/distributed/nn/jit:test_instantiator

buck build mode/dev-nosan //caffe2/test/distributed/nn/jit:test_instantiator && \
buck-out/gen/caffe2/test/distributed/nn/jit/test_instantiator\#binary.par -r test_instantiate_scripted_remote_module_template

buck build mode/dev-nosan //caffe2/test/distributed/nn/jit:test_instantiator && \
buck-out/gen/caffe2/test/distributed/nn/jit/test_instantiator\#binary.par -r test_instantiate_non_scripted_remote_module_template
```

```
buck test mode/dev-nosan //caffe2/test/distributed/nn/api:remote_module_fork
```

Differential Revision: D5708493

fbshipit-source-id: dd92695682433aaf79d1912c7956cef40a450eaf
2020-06-17 15:01:30 -07:00
7f88f037ac Stop running target bot on ci-all (#40186)
Summary:
So it can still be a useful way to get all the build configs that target specifier can't handle yet
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40186

Differential Revision: D22100671

Pulled By: ezyang

fbshipit-source-id: df291705e717c0c7e7cf4d675b9d49a1eba54a1d
2020-06-17 14:44:55 -07:00
b5bf21a6bd [JIT] Expose __deepcopy__ on script::Object (#40068)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/40068

Test Plan: Imported from OSS

Differential Revision: D22058808

Pulled By: jamesr66a

fbshipit-source-id: d8593b047c553389caea085337305ee893dc6877
2020-06-17 14:02:28 -07:00
1e03d603c6 [JIT] Infer NamedTuple type attributes of nn.Modules correctly (#39116)
Summary:
**Summary**
This commit modifies type inference for `nn.Module` instance attributes
such that the type of a `NamedTuple` attribute is inferred correctly and
such that the field names of this `NamedTuple` instance can be used in
scripted methods. At present, the type of this attribute is inferred to be
`Tuple[T, U, ..., V]`, so the field must be referred to by index and
cannot be referred to by name.

**Test Plan**
This commit adds a unit test to test that a field of a `NamedTuple`
attribute can be referred to by name in a scripted method.

**Fixes**
This commit fixes https://github.com/pytorch/pytorch/issues/37668.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39116

Differential Revision: D22082806

Pulled By: SplitInfinity

fbshipit-source-id: 1227e243ab941376cd5e382fb093751e88dc8846
2020-06-17 13:58:15 -07:00
c252dddcdd [quant][graphmode] Test JIT tracing for dynamic quant cases (#40128)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40128

Reland PR

Test Plan:
python test/test_quantization.py TestQuantizeDynamicScriptJitPasses

Imported from OSS

Differential Revision: D22081258

fbshipit-source-id: a3f7e26ea02ff8946f356afa7203129c6b3d658b
2020-06-17 13:41:56 -07:00
f6739ec8e8 [quant][graphmode] Refactor dynamic quant tests (#40127)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40127

Reland PR.
Similar to static quant, break it up into op level tests and tests for jit passes

Test Plan:
python test/test_quantization.py TestQuantizeScriptPTDQOps
python test/test_quantization.py TestDynamicQuantizeScriptJitPasses

Imported from OSS

Differential Revision: D22081259

fbshipit-source-id: cef8f78f89ef8789683b52508379ae1b9ad00700
2020-06-17 13:40:19 -07:00
2ba5f98dd1 Revert D22068657: [pytorch][PR] Remove global CMAKE_INSTALL_RPATH_USE_LINK_PATH directive
Test Plan: revert-hammer

Differential Revision:
D22068657

Original commit changeset: b04c529572a9

fbshipit-source-id: d8227dfc12d9b6382f7bf2905686b6025034561c
2020-06-17 13:05:01 -07:00
55bcb5dccc Fix inconsistent results of string split func on JIT mode (#38772)
Summary:
Resolve https://github.com/pytorch/pytorch/issues/38207

Below is the description of split function according to [Python doc](https://docs.python.org/3.8/library/stdtypes.html?highlight=split#str.split).

```
If sep is not specified or is None,  a different splitting algorithm is applied:
runs of consecutive whitespace are regarded as a single separator,
and the result will contain no empty strings at the start or end
if the string has leading or trailing whitespace.
```

The logic to handle both none and empty separators is added in register_string_ops.cpp as fix.

Signed-off-by: Xiong Wei <xiongw.fnst@cn.fujitsu.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38772

Differential Revision: D21789612

Pulled By: suo

fbshipit-source-id: 4dfd74eda71e0bfd757378daedc927a4a63ec0e4
2020-06-17 12:43:36 -07:00
5e77999ecb Add global hooks to torch.nn.Module (#38972)
Summary:
This allows registering hooks that will be executed for every module.

This idea arose in a discussion with tkerola and niboshi kindly proposed this approach.

The use case for this is to avoid boilerplate code when registering the same hook for all the modules in a complex model, the internal use-case was to allow every model to accept a NumPy array in the forward pass in a simpler way. Other use cases involve general mechanisms for plotting or tracing & debugging.

Currently, this is shared for all the modules but this can be worked out to have the hooks shared only per type of module.

If this functionality is not needed feel free to close the PR.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38972

Differential Revision: D22091364

Pulled By: albanD

fbshipit-source-id: 204ff5f9e119eff5bdd9140c64cb5dc467bb23a2
2020-06-17 12:20:35 -07:00
70192c651c [Reland #3] Include AT_PARALLEL_OPENMP/AT_PARALLEL_NATIVE/AT_PARALLEL_NATIVE_TBB to ATen/Config.h (#40122)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/40122

Test Plan: Imported from OSS

Reviewed By: ezyang

Differential Revision: D22076711

Pulled By: pbelevich

fbshipit-source-id: fa7b6335ebb5ef2ccf51dc60d9f4079e70f612ba
2020-06-17 12:10:43 -07:00
95e51bb7f8 change BuildExtension.with_options to return a class not a c-tor (#40121)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/40121

Test Plan: Imported from OSS

Reviewed By: ezyang

Differential Revision: D22076634

Pulled By: pbelevich

fbshipit-source-id: a89740baf75208065e418d7f972eeb52db9ee3cf
2020-06-17 12:09:09 -07:00
a71aefe857 [android][test_app] cleanup (#40136)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/40136

Test Plan: Imported from OSS

Differential Revision: D22084170

Pulled By: IvanKobzarev

fbshipit-source-id: f8d2d0494b3ac4f7fe2118238d621155d697d2c4
2020-06-17 11:07:44 -07:00
c958dd5472 [TensorPipe] Add guards against transferring GPU tensors (#40167)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40167

In v1.6 TensorPipe will not support transferring GPU tensors so, just like other agents, it should raise the appropriate errors when the user attempts to do so. One such error is when sending the arguments, another is when sending the result.
ghstack-source-id: 106059723

Test Plan: Re-enabled the test for this

Differential Revision: D22091737

fbshipit-source-id: 23dda98bc006333c6179361e8cfaf00ecda06408
2020-06-17 10:51:22 -07:00
08227fea4f Revert D22079377: [pytorch][PR] [RELAND] Change AccumulateGrad to yield .grads that match weights' memory layout
Test Plan: revert-hammer

Differential Revision:
D22079377

Original commit changeset: 9bd2b7e0c34f

fbshipit-source-id: c22cc349d790caa574eace0d63980854c33e5a59
2020-06-17 10:17:27 -07:00
1ec8ece2b9 [RELAND] Change AccumulateGrad to yield .grads that match weights' memory layout (#40129)
Summary:
https://github.com/pytorch/pytorch/pull/34904 was reverted because it had a misconfigured 4 GPU test that for some reason wasn't caught by external CI ([example failure](https://app.circleci.com/pipelines/github/pytorch/pytorch/181719/workflows/cfb37cd9-9a0c-4738-898b-d683934cd308/jobs/5868948/steps)).

This PR reverts the revert, and adds diffs that should repair the misconfigured test.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40129

Differential Revision: D22079377

Pulled By: albanD

fbshipit-source-id: 9bd2b7e0c34fdaf887497b52037cfe82cba709c1
2020-06-17 09:02:54 -07:00
5200814cfa Fix test_hook_* issues (#40135)
Summary:
Follows https://github.com/pytorch/pytorch/issues/38972

Some of the changes asked by albanD in the above review are appliable to the regular hooks tests.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40135

Differential Revision: D22091389

Pulled By: albanD

fbshipit-source-id: e1004213276bfb189167b9870e1a88b3d23b458c
2020-06-17 08:50:42 -07:00
216f512be2 Remove requirement of qnnpack engine for arm build. (#40112)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40112

The changed code should be run for arm build even if qnnpack engine is not
enabled.
Furthermore the way AT_DISPATCH* stubs are defined, it just forms a lambda out
of the __VA__ARGS and executes the lambda. Thus return inside such lambda just
return to the original function and we end up executing the fallback path as
well.
Thus also changed #endif to #else...#endif.
This was causing per regression on mobile in one of the models.
ghstack-source-id: 105990691

Test Plan: CI

Reviewed By: supriyar

Differential Revision: D22072780

fbshipit-source-id: b12ca66aa19834b97b3eb0067af4e656cb8b3241
2020-06-17 08:45:55 -07:00
8619d26338 Add batching rule for torch.expand (#40097)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40097

This is (probably) necessary for the vmap frontend API (coming up after
this PR should be the vmap frontend API).

There is some manual handling of sizes in the `expand_batching_rule`.
In particular, when performing expand(Tensor[B0, 3], [2, 3]), where B0
is a batch dimension and Tensor[B0, 3] is a batched tensor with batch
dimension B0, we can't call expand directly on the physical view and
instead first need to perform a view.
It's possible to add said view as a helper function on `VmapPhysicalView` but
after reading through the operator spreadsheet the conclusion was that
no other operator needs the same manual handling.

Test Plan: - `./build/bin/vmap_test`

Differential Revision: D22070657

Pulled By: zou3519

fbshipit-source-id: 911854b078a1a5c7d5934ef2e17b16673ed9d103
2020-06-17 08:42:17 -07:00
dec62dbfa3 Change VmapTransforms to use SmallVector instead of vector<int64_t> (#40042)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40042

See title. Dynamic allocations are generally bad for performance. This
change was not benchmarked because we have not gotten to the stage where
we want to benchmark performance.

Test Plan: - `./build/bin/vmap_test`

Differential Revision: D22070656

Pulled By: zou3519

fbshipit-source-id: f6cf74a357bb52b75c0a02f1f82495c0a5329a28
2020-06-17 08:42:15 -07:00
161fd5f507 Implement tensor.size(int) for BatchedTensor (#40028)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40028

We have this call native::size directly. Some alternatives I considered
were:
- Call VariableType::size directly. That seems isomorphic to what we're
doing now.
- when creating a BatchedTensor from a regular tensor, put all of the
keys on that tensor into the BatchedTensor's dispatch key set and use
the dispatcher fallthrough mechanism. That seems weird because
BatchedTensor is a tensor wrapper and also error prone because if
BatchedTensor gets the VariableType key, there's a chance that if
something goes wrong, an autogradmeta gets created on it...

Test Plan: - `./build/bin/vmap_test`

Differential Revision: D22070655

Pulled By: zou3519

fbshipit-source-id: 18530579ad41f3c4f96589da41eb24a46caf7af9
2020-06-17 08:40:41 -07:00
dea58a7660 [resubmit] Kill thrust::complex from log kernels (#40079)
Summary:
Use `::log` instead of `std::log` for better ROCm support.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40079

Differential Revision: D22068554

Pulled By: pbelevich

fbshipit-source-id: a458ae34535a641832f816617387a45445e2fa48
2020-06-17 05:57:10 -07:00
44c7a2ab69 [TensorPipe] Silence some more harmless warnings (#40094)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40094

In #39182 we already silenced a few warnings when they were caused by expected errors, but left one case out, namely errors on an incoming pipe. The idea was to introduce a "proper" way of detecting these, for example by having the remote end send an empty message to indicate an intentional shutdown. I don't know if we'll have time to do that in time for v1.6, so as a temporary solution I'm implementing some approximation which, although imperfect, should cover most errors. I also made the warning message less scary by adding a clarification.
ghstack-source-id: 105969540

Test Plan: Unit tests

Differential Revision: D22067818

fbshipit-source-id: b2e2a37d633f21bca4a2873a05ad92b853dde079
2020-06-17 02:44:50 -07:00
0152baa33a move some math ops back to full jit (#40149)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40149

Many math ops are moved to lite interpreter in D21992552,  but some ops (like log) also have tensor version and we didn't check duplicated names in this case. This breaks some existing models.

Move back most ops for now until we have a cleaner solution

Test Plan: build

Reviewed By: pengtxiafb

Differential Revision: D22085208

fbshipit-source-id: 951805f43f84bd614cf914c17e00444a122158e4
2020-06-17 02:07:57 -07:00
6de6041585 [iOS] Disable NNPACK on iOS builds (#39868)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39868

### Summary

why disable NNPACK on iOS

- To stay consistency with our internal version
- It's currently blocking some external users due to its lack support of x86 architecture
    - https://github.com/pytorch/pytorch/issues/32040
    - https://discuss.pytorch.org/t/undefined-symbols-for-architecture-x86-64-for-libtorch-in-swift-unit-test/84552/6
- NNPACK uses fast convolution algorithms (FFT, winograd) to reduce the computational complexity of convolutions with large kernel size. The algorithmic speedup is limited to specific conv params which are unlikely to appear in mobile networks.
- Since XNNPACK has been enabled, it performs much better than NNPACK on depthwise-separable convolutions which is the algorithm being used by most of mobile computer vision networks.

### Test Plan

- CI Checks

Test Plan: Imported from OSS

Differential Revision: D22087365

Pulled By: xta0

fbshipit-source-id: 89a959b0736c1f8703eff10723a8fbd02357fd4a
2020-06-17 01:39:56 -07:00
9d588f7ce2 Removes dunder div (#39151)
Summary:
BC-breaking note:

If a user is using one of these dunders directly they will not longer be available. Users should update to Python3 compatible dunders.

Original PR note:

`__div__` (and `__idiv__` and `__rdiv__`) are no longer special dunders in Python3. This PR replaces them with the `__truediv__` (`__itrudediv__`, `__rtruediv__`) dunders, since we no longer support Python2.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39151

Differential Revision: D22075713

Pulled By: mruberry

fbshipit-source-id: d318b47b51f7cc4c3728b1606a34d81e49ba0fa1
2020-06-16 23:02:20 -07:00
00505adbad Add net_pos Tiles added during in-batch broadcast (#40078)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40078

att. It's good to have net_pos for all the ops so that we can distinguish each op in minimizer in net_runner.

Test Plan: unittest

Reviewed By: ipiszy, ChunliF

Differential Revision: D22062748

fbshipit-source-id: 5266abdb6dde63055fdffdba6e8d65bd0f221d7b
2020-06-16 21:51:18 -07:00
e7a3a43d8f [pytorch] upload android build size to scuba (#40010)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40010

Use `upload_binary_size_to_scuba.py` to extract android library size and
upload to scuba.

Sample data: https://fburl.com/scuba/pytorch_binary_size/fz882auc
```
+-----------------------------------------------------------+-------------------+-----------+---------+-------------------------------------------------+------------------------------------------+----------+
|                        Build Name                         |      Branch       | Build Num |   Os    |                    Pkg Type                     |                   Sha1                   |   Size   |
+-----------------------------------------------------------+-------------------+-----------+---------+-------------------------------------------------+------------------------------------------+----------+
| linux_libtorch_3.7m_cpu                                   | gh/ljk53/149/head |   5842365 | linux   | libtorch                                        | f6821a3708bb2e88d9482f5ff60ecc6bcc60d85c | 111 MiB  |
| android_prebuild/aar/pytorch_android-release.aar__        | gh/ljk53/149/head |   5842981 | android | prebuild/aar/pytorch_android-release.aar        | f6821a3708bb2e88d9482f5ff60ecc6bcc60d85c | 29.1 MiB |
| android_prebuild-single/aar/pytorch_android-release.aar__ | gh/ljk53/149/head |   5842974 | android | prebuild-single/aar/pytorch_android-release.aar | f6821a3708bb2e88d9482f5ff60ecc6bcc60d85c | 7.96 MiB |
| android_prebuild/x86_64/libpytorch_jni.so__               | gh/ljk53/149/head |   5842981 | android | prebuild/x86_64/libpytorch_jni.so               | f6821a3708bb2e88d9482f5ff60ecc6bcc60d85c | 7.65 MiB |
| android_prebuild-single/x86/libpytorch_jni.so__           | gh/ljk53/149/head |   5842974 | android | prebuild-single/x86/libpytorch_jni.so           | f6821a3708bb2e88d9482f5ff60ecc6bcc60d85c | 7.62 MiB |
| android_prebuild/x86/libpytorch_jni.so__                  | gh/ljk53/149/head |   5842981 | android | prebuild/x86/libpytorch_jni.so                  | f6821a3708bb2e88d9482f5ff60ecc6bcc60d85c | 7.62 MiB |
| android_prebuild/arm64-v8a/libpytorch_jni.so__            | gh/ljk53/149/head |   5842981 | android | prebuild/arm64-v8a/libpytorch_jni.so            | f6821a3708bb2e88d9482f5ff60ecc6bcc60d85c | 6.44 MiB |
| android_prebuild/armeabi-v7a/libpytorch_jni.so__          | gh/ljk53/149/head |   5842981 | android | prebuild/armeabi-v7a/libpytorch_jni.so          | f6821a3708bb2e88d9482f5ff60ecc6bcc60d85c | 6.23 MiB |
| android_prebuild/x86_64/libfbjni.so__                     | gh/ljk53/149/head |   5842981 | android | prebuild/x86_64/libfbjni.so                     | f6821a3708bb2e88d9482f5ff60ecc6bcc60d85c | 319 KiB  |
| android_prebuild-single/x86/libfbjni.so__                 | gh/ljk53/149/head |   5842974 | android | prebuild-single/x86/libfbjni.so                 | f6821a3708bb2e88d9482f5ff60ecc6bcc60d85c | 319 KiB  |
| android_prebuild/x86/libfbjni.so__                        | gh/ljk53/149/head |   5842981 | android | prebuild/x86/libfbjni.so                        | f6821a3708bb2e88d9482f5ff60ecc6bcc60d85c | 319 KiB  |
| android_prebuild/arm64-v8a/libfbjni.so__                  | gh/ljk53/149/head |   5842981 | android | prebuild/arm64-v8a/libfbjni.so                  | f6821a3708bb2e88d9482f5ff60ecc6bcc60d85c | 282 KiB  |
| android_prebuild/armeabi-v7a/libfbjni.so__                | gh/ljk53/149/head |   5842981 | android | prebuild/armeabi-v7a/libfbjni.so                | f6821a3708bb2e88d9482f5ff60ecc6bcc60d85c | 214 KiB  |
+-----------------------------------------------------------+-------------------+-----------+---------+-------------------------------------------------+------------------------------------------+----------+
```

Test Plan: Imported from OSS

Differential Revision: D22040439

Pulled By: ljk53

fbshipit-source-id: 39116c768067edf25391428e36e5c403ad0715a5
2020-06-16 21:31:26 -07:00
3258cb61b1 Dynamic quantization support for LSTMCell, RNNCell and GRUCell [Remove randomness in weights] (#40102)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40102

Enable dynamic quantization for LSTMCell, RNNCell and GRUCell
ghstack-source-id: 105997236

(Note: this ignores all push blocking failures!)

Test Plan: buck test caffe2/test:quantization -- 'test_quantized_rnn_cell \(quantization\.test_quantize\.TestPostTrainingDynamic\)'

Differential Revision: D22071017

fbshipit-source-id: 3fe1eac39db9c1e0566838eb8b969bbb1fa983c9
2020-06-16 21:29:50 -07:00
03529ed14d Remove hacky double registration of to_here op in reg_distributed_ops (#39602)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39602

This was added as a part of
https://github.com/pytorch/pytorch/pull/38590 but we can use default arguments
here. We use fmt:;format to bind the default value to the rpc timeout at
runtime.
ghstack-source-id: 105983645

Test Plan: Ci

Differential Revision: D21912719

fbshipit-source-id: 7525c1322a95126f529301be142248af48565b82
2020-06-16 20:19:39 -07:00
15823ac6d5 Enhance DDP docstrings for DDP + RPC support. (#39916)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39916

ghstack-source-id: 105999275

Test Plan: waitforbuildbot

Differential Revision: D22013190

fbshipit-source-id: be3bb12b2281579610581b809c822ab6b027fa71
2020-06-16 20:05:13 -07:00
23739654cd Resubmit Remove THTensor_(fill) & THTensor_(zero) (#40108)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40108

Remove `THTensor_(fill)` & `THTensor_(zero)` following the PR https://github.com/pytorch/pytorch/pull/39042 as reference
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39727

Test Plan: buck test caffe2/caffe2/fb/python:pytorch_func_test

Reviewed By: dzhulgakov

Differential Revision: D22070199

Pulled By: ngimel

fbshipit-source-id: d32ff0cc0dbc8a80b49ce184f08bda34ad0f2668
2020-06-16 19:07:37 -07:00
bf544c4a7b [android][fbjni] Test_app and Readme update with the recent fbjni dep state (#40058)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/40058

Test Plan: Imported from OSS

Differential Revision: D22054574

Pulled By: IvanKobzarev

fbshipit-source-id: 751e5bd5103aa869702356fc181f458fe4fcfc83
2020-06-16 18:42:56 -07:00
f42c948df5 [quant][graphmode] Support another use pattern of mean (#40038)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/40038

Test Plan: Imported from OSS

Differential Revision: D22055696

fbshipit-source-id: 776196ce3d743deb8335d237bf5ef0fa67f7f26d
2020-06-16 18:37:21 -07:00
dcec099d48 Wrap Caffe2 (RowWise)SparseAdagrad fusion operator as a PT op (#39904)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39904

This diff wraps Caffe2's (RowWise)SparseAdagrad fusion operator on GPU as a PT op.

Reviewed By: jianyuh

Differential Revision: D22010193

fbshipit-source-id: 5df3c506c0dadd3b21180829fd2d5084ac76abc3
2020-06-16 17:40:05 -07:00
15758bca55 Refactor LSTM tests, [Remove randomness in weights] (#40101)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40101

Create three tests for LSTMs:
1. test_qlstm: Test to check numerics of quantized LSTM operator.
2. test_lstm_api: To check the LSTM module and compare
it with the quantized LSTM op
3. test_quantized_rnn: Check the dynamic quantization workflow, scriptability and serialization of quantized
LSTM
ghstack-source-id: 105997268

(Note: this ignores all push blocking failures!)

Test Plan:
buck test caffe2/test:quantization -- 'test_lstm_api \(quantization\.test_quantized_module\.TestDynamicQuantizedModule\)' --print-passing-details

buck test caffe2/test:quantization -- 'test_quantized_rnn \(quantization\.test_quantize\.TestPostTrainingDynamic\)'

buck test caffe2/test:quantization -- 'test_qlstm \(quantization\.test_quantized_op\.TestDynamicQuantizedRNNOp\)' --print-passing-details

Differential Revision: D22070826

fbshipit-source-id: 46c333e19b9eab8fa5cab6f132e89b80a635791a
2020-06-16 17:24:07 -07:00
3d8de74e17 Bumps readable file format version for torch.full inferring float from int values (#40089)
Summary:
Reserves file format version 5 for marking when torch.full(int)->FloatTensor will be deprecated.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40089

Differential Revision: D22066359

Pulled By: mruberry

fbshipit-source-id: 6158e03ca75e3795a2641123ff23d67975170f44
2020-06-16 15:09:40 -07:00
b5d54db6f4 Revert D22071278: [quant][graphmode] Refactor dynamic quant tests
Test Plan: revert-hammer

Differential Revision:
D22071278

Original commit changeset: 54292addcfbc

fbshipit-source-id: 20ffbea0fd05e974b31381437c61040b5b24c993
2020-06-16 15:01:05 -07:00
cb1a1942ee Revert D22071277: [quant][graphmode] Test JIT tracing for dynamic quant cases
Test Plan: revert-hammer

Differential Revision:
D22071277

Original commit changeset: e8aa8637e636

fbshipit-source-id: e89c3e03a7d695e1d4f5ff8d8c5172633db83984
2020-06-16 14:59:09 -07:00
64689c2474 Remove unecessary copy within blob serialization (#40096)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40096

Declaring `tensor_proto` to be of type `auto` means that it will copy the entire `TensorProto` instead of just keeping a reference. This changes it to just use a const reference instead.

Test Plan:
Using the model loader benchmark to measure model loading performance:

### `tensor_proto` is of type `const auto&`
```
============================================================================
caffe2/caffe2/fb/predictor/ModelLoaderBenchmark.cpprelative  time/iter  iters/s
============================================================================
BlobProtoInt32DeserializationFloat16                        11.08ms    90.27
BlobProtoByteDeserializationFloat16             1509.73%   733.73us    1.36K
----------------------------------------------------------------------------
BlobProtoInt32DeserializationUInt8                          10.48ms    95.45
BlobProtoByteDeserializationUInt8               2974.57%   352.22us    2.84K
============================================================================
```

### `tensor_proto` is of type `auto`
```
============================================================================
caffe2/caffe2/fb/predictor/ModelLoaderBenchmark.cpprelative  time/iter  iters/s
============================================================================
BlobProtoInt32DeserializationFloat16                        13.84ms    72.26
BlobProtoByteDeserializationFloat16              658.85%     2.10ms   476.08
----------------------------------------------------------------------------
BlobProtoInt32DeserializationUInt8                          17.09ms    58.51
BlobProtoByteDeserializationUInt8               3365.98%   507.80us    1.97K
============================================================================
```

Reviewed By: marksantaniello

Differential Revision: D21959644

fbshipit-source-id: 6bc2dfbde306f88bf7cd4f9b14b95ac69c2e1b4d
2020-06-16 14:45:59 -07:00
ba98c0e38c Split TensorIteratorConfig out of TensorIterator (#39803)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39803

The basic concept is to make it more clear what the construction
side API is, as opposed to the "I want to actually do kernel stuff
with TensorIterator" API (which has been kept on TensorIterator.)
In fact, most of the stuff in TensorIteratorConfig isn't used by
TensorIterator later, so it can be dropped entirely after construction.

Before:

```
TensorIterator iter;
iter.config1();
iter.config2();
iter.config3();
iter.build();
// use iter
```

Now:

```
TensorIterator iter = TensorIteratorConfig()
  .config1()
  .config2()
  .config3()
  .build();
// use iter
```

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Differential Revision: D22018845

Pulled By: ezyang

fbshipit-source-id: 5baca9a4dc87149d71a44489da56d299f9b12b34
2020-06-16 14:33:18 -07:00
54c0ee1ebc LayerNorm use Fused Multiply and Add (#40012)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40012

Use Fused Multiply and Add

Test Plan: Tested using the test_layernorm_nnpi_fp16.py test case.

Reviewed By: hyuen

Differential Revision: D22039340

fbshipit-source-id: d979daac152f885318ddcbbb9d7108219d4743e9
2020-06-16 14:27:00 -07:00
da8cd8260b Fix KeypointRCNN test (#39589)
Summary:
Since Argmax is updated in ONNX Runtime we can enable testing for all output, including keypoints_scores.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39589

Reviewed By: hl475

Differential Revision: D21992264

Pulled By: houseroad

fbshipit-source-id: a390b4628d2ac290902b9e651c69d47db9be540f
2020-06-16 13:45:23 -07:00
f69b72c738 Back out "Revert D21986243: TORCH_FN" (#40110)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40110

Original commit changeset: 72c690c2b4c2
ghstack-source-id: 105993222

Test Plan: waitforsandcastle

Differential Revision: D22072829

fbshipit-source-id: 0bc1a3e389e2afb05688c472793d34eaddb67f2a
2020-06-16 13:38:29 -07:00
41fa4bef2a [quant] Support general op modules with inplace options (#39919)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39919

e.g. torch.nn.ReLU6(inplace=True)
looks like this is already supported, but somehow it is not working in tutorial

Test Plan: Imported from OSS

Differential Revision: D22055695

fbshipit-source-id: 78a55b963cd3fac06f952f83c7c61c717cc839cc
2020-06-16 13:19:14 -07:00
fa4244d783 [quant][graphmode] Test JIT tracing for dynamic quant cases (#40040)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/40040

Test Plan:
python test/test_quantization.py TestQuantizeDynamicScriptJitPasses

Imported from OSS

Differential Revision: D22071277

fbshipit-source-id: e8aa8637e6364092b6ff1c3a48dfc4551eb645ec
2020-06-16 13:16:42 -07:00
ddeaa74382 [quant][graphmode] Refactor dynamic quant tests (#40039)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40039

Similar to static quant, break it up into op level tests and tests for jit passes

Test Plan:
python test/test_quantization.py TestQuantizeScriptPTDQOps
python test/test_quantization.py TestDynamicQuantizeScriptJitPasses

Imported from OSS

Differential Revision: D22071278

fbshipit-source-id: 54292addcfbc00f7af960fb333921db2ff9fda04
2020-06-16 13:14:48 -07:00
461aa8a1e2 [quant][graphmode] Support quantizing repeat (#39925)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/39925

Test Plan: Imported from OSS

Differential Revision: D22014883

fbshipit-source-id: 3076e64948f7ebdd99355f32185e2c716b113b43
2020-06-16 12:29:11 -07:00
ee365c58e1 Fix destructor ordering for cuda handle pools (#39345)
Summary:
Possible fix for gh-38385. Unfortunately, I haven't been able to reproduce the issue reliably, so can't say for certain.

Since this appears to be a destruction ordering issue, I've focused on making the destructor calls well-ordered:
- Each pool is now a function-local `static` instead of a global variable. This ensures the destructor happens before any relevant pytorch global state is destroyed.
- Each pool window now only stores a `std::weak_ptr` to the global pool. This means it can't extend the lifetime of the pool outside of the normal destructor ordering. That does also mean that if the `weak_ptr` is invalid, the handles will get leaked. However, that shouldn't happen under normal use.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39345

Differential Revision: D22044376

Pulled By: ezyang

fbshipit-source-id: da1713b42c143ed1452a6edf1ecb05cd45743c7a
2020-06-16 12:23:27 -07:00
145df306ae Avoid using default process group in ProcessGroupAgent. (#39909)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39909

As described in https://github.com/pytorch/pytorch/issues/33583,
ProcessGroupAgent initializes the default process group and this causes issues
if the user initializes the default process group themsleves. Either the RPC
initialization would fail or the user's process group initialization would
fail.

To avoid this, I've changed ProcessGroupAgent init to create its own
ProcessGroupGloo and not use the default one at all.

Closes: https://github.com/pytorch/pytorch/issues/33583
ghstack-source-id: 105953303

Test Plan: waitforbuildbot

Differential Revision: D22011868

fbshipit-source-id: 7346a3fcb2821a0bc08e0bdc0625947abb5ae16f
2020-06-16 12:00:29 -07:00
7021635d61 fix more duplicated names (#40062)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40062

fix duplicated op names after D21992552

Test Plan: build

Reviewed By: iseeyuan

Differential Revision: D22056588

fbshipit-source-id: 6d2fcf16b5b86b30b6ac7a4107b20c8cfb6816b0
2020-06-16 11:47:05 -07:00
7f270233fb Upgrade DNNL to 1.5 (#40088)
Summary:
- Bump DNNL to 1.5
- Bug fixes and improvements in ideep
  - suppress g++ Wreorder warning
  - avoid rebuilding `libmkldnn.so` https://github.com/oneapi-src/oneDNN/issues/743
  - enable conv3d (integration code was checked in by Xiaobing https://github.com/pytorch/pytorch/pull/35662)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40088

Differential Revision: D22071530

Pulled By: albanD

fbshipit-source-id: e7a53d7421e8a7a03e36a7dfb68edc565a2f00df
2020-06-16 11:42:30 -07:00
ec1833bc3c Revert D22069566: Revert D22013026: [quant][graphmode] Pass debug option into insert_quant_dequant pass
Test Plan: revert-hammer

Differential Revision:
D22069566

Original commit changeset: 6230bc806089

fbshipit-source-id: 930490ab0b6a017c949445620e7c6b7056693998
2020-06-16 11:37:33 -07:00
49732f0450 Remove global CMAKE_INSTALL_RPATH_USE_LINK_PATH directive (#37737)
Summary:
Closes gh-35418,

PR gh-16414 added [the `CMAKE_INSTALL_RPATH_USE_LINK_PATH`directive](https://github.com/pytorch/pytorch/pull/16414/files#diff-dcf5891602b4162c36c2125c806639c5R16) which is non-standard and will cause CMake to write an `RPATH` entry for libraries outside the current build. Removing it leaves an RPATH entry for `$ORIGIN` but removes the entries for things like `/usr/local/cuda-10.2/lib64/stubs:/usr/local/cuda-10.2/lib64` for `libcaffe2_nvrtc.so` on linux.

The added test fails before this PR, passes after. It is equivalent to checking `objdump -p torch/lib/libcaffe2_nvrtc.so | grep RPATH` for an external path to the directory where cuda "lives"

I am not sure if it solve the `rpath/libc++.1.dylib` problem for `_C.cpython-37m-darwin.so` on macOS in issue gh-36941
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37737

Differential Revision: D22068657

Pulled By: ezyang

fbshipit-source-id: b04c529572a94363855f1e4dd3e93c9db3c85657
2020-06-16 11:18:39 -07:00
d57ca73c53 Remove item and data_ptr for std::complex (#39838)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/39838

Differential Revision: D22068251

Pulled By: ezyang

fbshipit-source-id: d1f0e1ff98290a139f1a080a9f7a1258943cd3ad
2020-06-16 11:13:54 -07:00
181ea1acce [quant][graphmode] Support squeeze/unsqueeze (#39924)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/39924

Test Plan: Imported from OSS

Differential Revision: D22014713

fbshipit-source-id: 76d6d8509062ff9203979b0d2f0cfb01778b3c2f
2020-06-16 11:03:32 -07:00
f1a5f66115 [xplat] Add Windows specific ATen build definitions (#40092)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40092

Move <windows.h> include in THAllocator after header that might include glog is included

Test Plan: buck build xplat/mode/arstudio/windows //xplat/caffe2:aten_cpuWindows

Reviewed By: nlutsenko

Differential Revision: D22061135

fbshipit-source-id: 10f51955c0092761a96bc6169236c6e07b412313
2020-06-16 10:57:02 -07:00
b3dd4d9c33 [JIT] remove callable check to compile objects with __call__ (#40041)
Summary:
Fix for https://github.com/pytorch/vision/issues/2320 - still need to fix whatever reverting this change breaks

EDIT: reverting this change doesnt seem to break anything, and fixes the torchvision issue
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40041

Reviewed By: eellison

Differential Revision: D22067586

Pulled By: fmassa

fbshipit-source-id: 4b235fd3a69665dcc5689f12310097be31b40a28
2020-06-16 10:52:38 -07:00
f1e575a0bf Revert D20496044: [pytorch][PR] Change AccumulateGrad to yield .grads that match weights' memory layout
Test Plan: revert-hammer

Differential Revision:
D20496044

Original commit changeset: 248d680f4b1b

fbshipit-source-id: 6462b25e3fb9c8596c1da443389089f09c32df4d
2020-06-16 10:38:40 -07:00
4b5530de72 optimize upsample performance linear mode on CPU (#34864)
Summary:
This pr aims at improving  `nn.UpSample()` performance on CPU with mode `linear`, `bilinear`, `trilinear`.
For single socket inference, up to **31x** performance improvement.
For single core inference, up to **1.8x** performance improvement.
For dual socket training, up to **28x** performance improvement.

`channel last` format kernel also provided.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34864

Differential Revision: D20772990

Pulled By: ngimel

fbshipit-source-id: a48307f2072227f20e742ebbd4a093bb29537d19
2020-06-16 10:36:58 -07:00
5843854e66 [TensorPipe] Fix transport/channel priorities (#40090)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40090

I messed up in #39957: TensorPipe used to have a bug where it inverted priorities and preferred lower ones over higher ones. I had fixed that bug at the same time as I was writing that PR but then forgot to update the priority values once that PR landed. So this meant that TensorPipe was trying to bootstrap using SHM and then upgrade to UV. That worked in our tests because they are all run on the same machine, but that broke using TensorPipe across different machines. I'll take suggestions on how to have tests in place to prevent this type of breakages from happening.

The silver lining is that for some time our tests were testing the UV transport, instead of the SHM one, and it seems to be working alright. ;)
ghstack-source-id: 105967203

Differential Revision: D22067264

fbshipit-source-id: c6e3ae7a86038714cfba754b0811ca8a9a6f1347
2020-06-16 10:28:42 -07:00
dd581b4512 DOC: fix rpc reference in top-level index (#40077)
Summary:
Fixes gh-40046

PR gh-37419 refactored the content of `docs/source/rpc/index.rst` into `docs/source/rpc.rst` but did not link to the latter from `doc/source/index.rst` so the top-level RPC documentation is missing from https://pytorch.org/docs/master/.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40077

Differential Revision: D22068128

Pulled By: mrshenli

fbshipit-source-id: 394433f98f86509e0c9cb6d910a86fb8a2932683
2020-06-16 10:26:03 -07:00
56b4b44107 Batching rule for torch.mul (#39859)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39859

This PR implements the batching rule for `torch.mul` for the (Tensor,
Tensor) overload.

NB: ~250 lines of this PR are tests, so please don't be scared away by
the line count.

It introduces the BroadcastingVmapTransform, which is the VmapTransform
one should use for operations that broadcast their inputs. This
transform:
- permutes all batch dimensions to the front of the tensors
- aligns the batch dimensions of the tensors, adding extra 1's where
necessary
- aligns the non-batch dims of the tensors, adding extra 1's where
necessary.

Test Plan:
- Test BroadcastingVmapTransform in `./build/bin/vmap_test`.
- Test mul_batching_rule in `./build/bin/vmap_test`.

Differential Revision: D22067337

Pulled By: zou3519

fbshipit-source-id: 5862da8c2b28699b08c7884342a1621581cb2e7f
2020-06-16 10:25:59 -07:00
33b82c7271 [JIT] Add registry for backend lowering functions (#39552)
Summary:
**Summary**
This commit adds a registry for storing lowering functions for backends.
Instead of backends registering these lowering functions in separate C
extension modules, these will be registered in the Torch extension.
Backends are registered statically, so a registry is needed to hold
these lowering functions until Python bindings are created.

**Test Plan**
`python test/test_jit.py TestBackends`

```
Couldn't download test skip set, leaving all tests enabled...
..
----------------------------------------------------------------------
Ran 2 tests in 0.104s

OK
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39552

Reviewed By: mortzur

Differential Revision: D22033855

Pulled By: SplitInfinity

fbshipit-source-id: 05abf152274e5e51c37b6004886ea25bd4d33b80
2020-06-16 10:23:14 -07:00
ad86c94f14 Reduce memory requirement for test_argminmax_large_axis (#40036)
Summary:
Closes gh-39060

The `TensorIterator` splitting is based on `can_use_32bit_indexing` which assumes 32-bit signed ints, so we can get away with just 2**31 as the axis length. Also tested on an old commit that I can reproduce the test failure on just a 1d tensor, overall quartering the memory requirement for the test.

4c7d81f847/aten/src/ATen/native/TensorIterator.cpp (L879)

For reference, the test was first added in gh-33310.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40036

Differential Revision: D22068690

Pulled By: ezyang

fbshipit-source-id: 83199fd31647d1ef106b08f471c0e9517d3516e3
2020-06-16 10:19:10 -07:00
5add2e861c Revert D21628596: Refactor LSTM tests
Test Plan: revert-hammer

Differential Revision:
D21628596

Original commit changeset: 4aeda899f2e5

fbshipit-source-id: ab6544b87404863e054172aa9ec7ada51fad8e5e
2020-06-16 10:14:15 -07:00
e55e0cb1a9 Revert D20978736: Dynamic quantization support for LSTMCell, RNNCell and GRUCell
Test Plan: revert-hammer

Differential Revision:
D20978736

Original commit changeset: 8f303ba1d7f8

fbshipit-source-id: bcd300819616d6536f582fcd3c90decd543c4657
2020-06-16 10:11:32 -07:00
5f6e55fb32 Clean up thrust::complex from tanh_backward (#39827)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/39827

Differential Revision: D22067191

Pulled By: anjali411

fbshipit-source-id: a257f0e33a917c13a7d0b855a869e0b9aca61541
2020-06-16 10:09:39 -07:00
b372000d69 [quant][graphmode] Run RemoveRedundantDequantize in the end (#39923)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/39923

Test Plan: Imported from OSS

Differential Revision: D22014712

fbshipit-source-id: a681bbab5d90ac8ebeff03cde5277501e86ee1d4
2020-06-16 09:55:03 -07:00
305921734a Revert D22013026: [quant][graphmode] Pass debug option into insert_quant_dequant pass
Test Plan: revert-hammer

Differential Revision:
D22013026

Original commit changeset: 714b938f25c1

fbshipit-source-id: 6230bc8060892e6485159ca88cc3ad49217791a2
2020-06-16 09:44:04 -07:00
12cf8390e6 Update aarch64 CI badge (#39914)
Summary:
This PR added python37 and python38 badge for aarch64 build  CI.

You can preview the badge here: https://github.com/wangxiyuan/pytorch/tree/update_aarch64_ci

The build job is passing now since we use CLANG instead GCC for building.

Using GCC still hit error which is mentioned in https://github.com/pytorch/pytorch/issues/33124

Related: https://github.com/pytorch/pytorch/issues/39558
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39914

Differential Revision: D22068834

Pulled By: ezyang

fbshipit-source-id: d8a2ec795408850ec6eba3af7b29ddfeb3cbea38
2020-06-16 09:22:42 -07:00
48db06e39a Dynamic quantization support for LSTMCell, RNNCell and GRUCell (#37159)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37159

Enable dynamic quantization for LSTMCell, RNNCell and GRUCell
ghstack-source-id: 105946183

(Note: this ignores all push blocking failures!)

Test Plan: buck test caffe2/test:quantization -- 'test_quantized_rnn_cell \(quantization\.test_quantize\.TestPostTrainingDynamic\)'

Differential Revision: D20978736

fbshipit-source-id: 8f303ba1d7f8e0c646ac73e862d2c1e735b7ff61
2020-06-16 09:14:59 -07:00
2beb9690c3 Change AccumulateGrad to yield .grads that match weights' memory layout (#34904)
Summary:
Currently, whether `AccumulateGrad`  [steals](67cb018462/torch/csrc/autograd/functions/accumulate_grad.h (L42)) or [clones](67cb018462/torch/csrc/autograd/functions/accumulate_grad.h (L80)) an incoming gradient, the gradient ends up rowmajor contiguous, regardless of its param's layout.  If the param's layout is channels last, or otherwise not rowmajor contigous, later kernels that apply gradients to params are forced into an uncoalesced memory access pattern for either the param or the gradient.  This may not sound like a big deal but for any binary op on large tensors it's a >3X increase in gmem traffic => 3X slowdown.

The present PR changes `AccumulateGrad` to prefer, where possible, stashing gradients that match their params' layouts (["Gradient Layout Contract"](https://github.com/pytorch/pytorch/pull/34904/files#diff-ef1a56d24f66b280dcdb401502d6a796R29-R38)).

Allowing `AccumulateGrad` to stash non-rowmajor-contiguous grads means DDP allreduces and DP reduces must allow non-rowmajor-contiguous grads.  This PR extends DDP and DP to allow gradients with non-rowmajor-contiguous strides as long as their layout is nonoverlapping and dense.

For good measure, I include changes that allow all five nccl primitives (allreduce, reduce, broadcast, allgather, reducescatter) to act on non-rowmajor-contiguous tensors (again as long as each input's layout is nonoverlapping and dense, and as long as all tensors participating in a given collective have the same layout).  The primitive comm changes aren't necessary to enable the DDP changes, but I wasn't sure this would end up true until I had written both sets of changes.  I think primitive comm enablement is reasonable to keep in the PR, especially since the code for it is simple.

Channels last params will be a major beneficiary of this PR, but I don't see it as channels-last-specific fix.  The spirit is layout matching in general:
- Grads should be stashed with memory layouts matching their params.
- Src and dst tensors on opposite ends of collectives should have matching dense layouts.

This PR also updates autograd docs to describe potential BC-breaking changes below.

## BC notes
ngimel albanD gchanan

#### BC-breaking
In the common case where the user lets AccumulateGrad decide grad layouts, strides for grads of dense but non-rowmajor-contiguous params will change.  Any user code that was accustomed to `view(-1)`ing these grads will break.

Also, the circumstances under which a grad can be stolen directly from the backward function that created it, as opposed to deep-copied by AccumulateGrad, have changed.  In most cases we expect silent performance improvement, because we expect channels-last-aware backward kernels will create channels last gradients for channels last params.  Now those can be stolen, whereas before this PR they were cloned and made rowmajor contiguous.  IMO this is a mild BC breakage.  Param backward hooks still see grads come in with whatever format the backward kernel gave them.  The only BC breakage potential I see is if user code relies somehow on a grad in a hook having or not having the same deep memory as the eventual `param.grad`.  Any such users hopefully know they're off the edge of the map and understand how to update their expectations.

#### BC escape hatches
At alband's recommendation, this PR's changes to AccumulateGrad do not alter the pre-PR code's decisions about whether grad is accumulated in or out of place.  Accumulations of new grads onto an existing `.grad` attribute were (usually) in-place before this PR and remain in-place after this PR, keeping the existing `.grad`'s layout.  After this PR, if the user wants to force accumulation into a grad with a particular layout, they can preset `param.grad` to a zeroed tensor with the desired strides or call `grad.contiguous(desired format)`.  This likely won't be as performant as letting AccumulateGrad establish grad layouts by cloning or stealing grads with contract-compliant strides, but at least users have a control point.

One limitation (present before this PR and unchanged by this PR):  Presetting `param.grad` does not ensure in-place accumulation all the time.  For example, if `create_graph=True`, or if incoming `new_grad` is dense and existing `variable_grad` is sparse, accumulation occurs out of place, and the out-of-place result may not match the existing grad's strides.

----------------------------
I also noticed some potential DDP improvements that I considered out of scope but want to mention for visibility:
1. make sure Reducer's ops sync with AccumulateGrad streams
2. ~to reduce CPU overhead and incur fewer kernel launches, lazily create flat `contents` tensors by a single `cat` kernel only when a bucket is full, instead of `copy_`ing grads into `contents` individually as soon as they are received.~  PR includes a [minor change](https://github.com/pytorch/pytorch/pull/34904/files#diff-c269190a925a4b0df49eda8a8f6c5bd3R312-R315) to divide grads while copying them into flat buffers, instead of copying them in, then dividing separately.  Without cat+div fusion, div-while-copying is the best we can do.
3. https://github.com/pytorch/pytorch/issues/38942
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34904

Differential Revision: D20496044

Pulled By: albanD

fbshipit-source-id: 248d680f4b1bf77b0a986451844ec6e254469217
2020-06-16 08:43:31 -07:00
c065049592 Add smoke test to Windows CI (#39941)
Summary:
Related pr: https://github.com/pytorch/builder/pull/456
test at:https://github.com/pytorch/pytorch/pull/39943
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39941

Differential Revision: D22068004

Pulled By: ezyang

fbshipit-source-id: 5244f64d842b3a8bf0af720dffb4b1a0370cc178
2020-06-16 08:29:02 -07:00
d71804a57d Eliminate TensorIterator::add_output with explicit dtype. (#39800)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39800

I'm working on a refactor where I want to represent the inputs
and outputs to TensorIterator as just plain Tensors, which means
I need to kill this add_output with explicit dtype.  This exists
solely to set what the output dtype should be.  We have a pretty
similar API for doing this for shapes (declare_static_shape) so
I just copied this API for dtypes instead.

Although the new version is more code, I think the intent is more
explicit.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Differential Revision: D21981740

Pulled By: ezyang

fbshipit-source-id: cf45a6dbab6fb979ca3b412c31eca3dd4f4067de
2020-06-16 08:24:27 -07:00
23db54acdf [DataLoader] add repr for WorkerInfo (#39975)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/39975

Differential Revision: D22039414

Pulled By: ezyang

fbshipit-source-id: 230f68a91fca901bce652fdf88ba88167f39b978
2020-06-16 08:19:32 -07:00
ee5ad6ce25 [quant][graphmode] Pass debug option into insert_quant_dequant pass (#39915)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39915

Some of the usage, e.g. add_scalar will not be supporting the debug option,
that is, we will not have a numerically exact representation of the final quantized model
before finalize if people use add scalar.
warning will be added in a later PR.

Test Plan: Imported from OSS

Differential Revision: D22013026

fbshipit-source-id: 714b938f25c10fad3dfc79f095356b9803ef4b47
2020-06-16 08:14:50 -07:00
ebd869153c Clarifies compare_with_numpy behavior (#40064)
Summary:
Currently compare_with_numpy requires a device and dtype, but these arguments are ignored if a tensor is provided. This PR updates the function to only take device and dtype if a tensor-like object is given. This should prevent confusion that you could, for example, pass a CPU float tensor but provided a CUDA device and integer dtype.

Several tests are updated to reflect this behavior.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40064

Differential Revision: D22058072

Pulled By: mruberry

fbshipit-source-id: b494bb759855977ce45b79ed3ffb0319a21c324c
2020-06-16 05:01:33 -07:00
8939849f72 Revert D21986243: TORCH_FN
Test Plan: revert-hammer

Differential Revision:
D21986243

Original commit changeset: a123571c18aa

fbshipit-source-id: 72c690c2b4c2fc39e1c9192d1c410f49bb4077a5
2020-06-16 04:43:46 -07:00
12cb80b5b8 TORCH_FN (#39823)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39823

Add a compile time function pointer that can be used to pass function pointers in template args.
This is very useful for metaprogramming function wrappers.
ghstack-source-id: 105944072

Test Plan: waitforsandcastle

Differential Revision: D21986243

fbshipit-source-id: a123571c18aa0e65908cbb131f28922ceb59061c
2020-06-16 03:08:08 -07:00
144e8dc5a3 [quant][graphmode] Use quantizedbatch_norm in graph mode (#39911)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/39911

Test Plan: Imported from OSS

Differential Revision: D22012282

fbshipit-source-id: 98af55172cbeaa7080865d6533df21647a7cedfa
2020-06-16 00:58:11 -07:00
655f1ea176 Refactor LSTM tests (#38851)
Summary:
Create three tests for LSTMs:
1. test_qlstm: Test to check numerics of quantized LSTM operator.
2. test_lstm_api: To check the LSTM module and compare
it with the quantized LSTM op
3. test_quantized_rnn: Check the dynamic quantization workflow, scriptability and serialization of quantized
LSTM
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38851

ghstack-source-id: 105945574

(Note: this ignores all push blocking failures!)

Test Plan:
buck test caffe2/test:quantization -- 'test_lstm_api \(quantization\.test_quantized_module\.TestDynamicQuantizedModule\)' --print-passing-details

buck test caffe2/test:quantization -- 'test_quantized_rnn \(quantization\.test_quantize\.TestPostTrainingDynamic\)'

buck test caffe2/test:quantization -- 'test_qlstm \(quantization\.test_quantized_op\.TestDynamicQuantizedRNNOp\)' --print-passing-details

Differential Revision: D21628596

fbshipit-source-id: 4aeda899f2e5f14bfbe3d82096cb4ce89c725fa1
2020-06-16 00:41:24 -07:00
bcb44796ba [pytorch] consolidate android gradle build scripts (#39999)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39999

Cleaned up the android build scripts. Consolidated common functions into
common.sh. Also made a few minor fixes:

- We should trust build_android.sh doing right about reusing existing
  `build_android_$abi` directory;

- We should clean up `pytorch_android/src/main/jniLibs/` to remove
  broken symbolic links in case custom abi list changes since last build;

Test Plan: Imported from OSS

Differential Revision: D22036926

Pulled By: ljk53

fbshipit-source-id: e93915ee4f195111b6171cdabc667fa0135d5195
2020-06-15 23:55:21 -07:00
9204d76b5f Back out "[pytorch][PR] Remove THTensor_(fill) & THTensor_(zero)"
Summary: Original commit changeset: bfeeaebe93d9

Test Plan: CI runs

Differential Revision: D22062523

fbshipit-source-id: 6d827fd682a9e64c49876cd1c7269d145e93dc2c
2020-06-15 23:49:58 -07:00
f0b40cac30 [pytorch] simplify android circleci definition data model
Summary:
Android jobs don't seem to fit to `pytorch_build_data.py` data model
very well. Other mobile jobs all have their own data model files - even
for Android nightly jobs. As we are adding more variants like vulkan,
it's going to be hard to maintain.

So renamed `android_gradle.py` to `android_definitions.py` and moved
android jobs into it, following the conventions of `nightly_android.py`
and `ios_definitions.py`.

Differential Revision: D22036915

Test Plan: Imported from OSS

Reviewed By: IvanKobzarev

Pulled By: ljk53

fbshipit-source-id: 42ad5cbe451edecef17f6d3cbf85076cc3acf615
2020-06-15 23:33:27 -07:00
18fe9d267c Revert D22050656: [Yet Another Reland] Include AT_PARALLEL_OPENMP/AT_PARALLEL_NATIVE/AT_PARALLEL_NATIVE_TBB to ATen/Config.h
Test Plan: revert-hammer

Differential Revision:
D22050656

Original commit changeset: 274bc0733c37

fbshipit-source-id: ed9af6305dab96bb79a2016ce82f80af4bd1e5b7
2020-06-15 22:26:07 -07:00
1a388da10a [quant] add quantized::batch_norm (#39910)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39910

We need this for graph mode quantization, since we only have `aten::batch_norm` the dimension
is only known at runtime, we'll need to quantize it to `quantized::batch_norm`

Test Plan: Imported from OSS

Differential Revision: D22012281

fbshipit-source-id: 2973d86a17a02b7bdc36bd1e703e91584d9139d0
2020-06-15 21:32:58 -07:00
5d4a662846 DNNL: fix F.max_pool2d and F.avg_pool2 issue when stride=None (#39221)
Summary:
For F.max_pool2d and F.avg_pool2d, there has **RuntimeErro**r when stride is **None**, this PR sovle it.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39221

Differential Revision: D22059565

Pulled By: ngimel

fbshipit-source-id: 2080e1e010815aedd904c58552e92be9f7443d38
2020-06-15 21:00:12 -07:00
399dd84c8c Fix TensorPipeAgent shutdown to ensure it drains all outstanding work. (#40060)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40060

As part of debugging https://github.com/pytorch/pytorch/issues/39855,
I noticed that TensorPipeAgent's ThreadPool was still executing tasks when the
python interpreter was shutting down. This caused issues with
pybind::gil_scoped_acquire() since it can't be called when the interpreter is
shutting down resulting in a crash.

The reason for this was that TensorPipeAgent was calling waitWorkComplete and
then shutting down the listeners. This meant that after waitWorkComplete
returned, there could still be a race where an RPC call gets enqueued before we
shutdown listeners.

To avoid this situation, I've moved the call to waitWorkComplete at the end of
shutdown (similar to ProcessGroupAgent).

Closes: https://github.com/pytorch/pytorch/issues/39855
ghstack-source-id: 105926653

Test Plan:
1) Ran test_backward_node_failure
(__main__.TensorPipeAgentDistAutogradTestWithSpawn) 100 times to verify the
fix.
2) waitforbuildbot

Differential Revision: D22055708

fbshipit-source-id: 2cbe388e654b511d85ad416e696f3671bd369372
2020-06-15 20:38:25 -07:00
d4faf14cb2 [Yet Another Reland] Include AT_PARALLEL_OPENMP/AT_PARALLEL_NATIVE/AT_PARALLEL_NATIVE_TBB to ATen/Config.h (#40045)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40045

Fixes #39471

Reland of #39612 and #39881

Test Plan: Imported from OSS

Differential Revision: D22050656

Pulled By: pbelevich

fbshipit-source-id: 274bc0733c37ef6c87deb3344bb49ca9107e257b
2020-06-15 20:05:30 -07:00
f13be5fde1 Check if generator has next normal sample cache methods in normal_distribution (#39816)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39816

This change replaces [`#if !defined(__CUDACC__) && !defined(__HIPCC__)`](856215509d/aten/src/ATen/core/DistributionsHelper.h (L147)) with SFINAE expression that checks if RNG typename has next_double_normal_sample, set_next_double_normal_sample, next_float_normal_sample, set_next_float_normal_sample methods

It is required by (and manually tested with) https://github.com/pytorch/csprng/pull/28

Fixes #39618

Test Plan: Imported from OSS

Differential Revision: D22002599

Pulled By: pbelevich

fbshipit-source-id: e33d42a7e88c5729b077b9cdbf1437158dab48bc
2020-06-15 19:57:04 -07:00
00651b8c93 [distribtued.nn] Implement TorchScript-compatible RemoteModule API (#37139)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37139

See design doc in https://github.com/pytorch/pytorch/issues/37136

ghstack-source-id: 105926270

Test Plan:
TODO:

- Make the generated Interface usable. https://github.com/pytorch/pytorch/pull/37139#discussion_r434190978
-
- Avoid generating the same template instances for Module that is not scriptable.
- Remove "infer_module_interface_cls".
- Use Python format instead of a CodeTemplate
- Use Python tempfile to track and delete file. Does it work if there is crash.

```
buck test mode/dev-nosan //caffe2/test/distributed/nn/jit:test_instantiator

buck build mode/dev-nosan //caffe2/test/distributed/nn/jit:test_instantiator && \
buck-out/gen/caffe2/test/distributed/nn/jit/test_instantiator\#binary.par -r test_instantiate_scripted_remote_module_template

buck build mode/dev-nosan //caffe2/test/distributed/nn/jit:test_instantiator && \
buck-out/gen/caffe2/test/distributed/nn/jit/test_instantiator\#binary.par -r test_instantiate_non_scripted_remote_module_template
```

```
buck test mode/dev-nosan //caffe2/test/distributed/nn/api:remote_module_spawn
```

```
buck test mode/dev-nosan //caffe2/test/distributed/nn/api:remote_module_fork

buck build mode/dev-nosan //caffe2/test/distributed/nn/api:remote_module_fork && \
buck-out/gen/caffe2/test/distributed/nn/api/remote_module_fork\#binary.par -r test_user_provided_global_unique_name

buck build mode/dev-nosan //caffe2/test/distributed/nn/api:remote_module_fork && \
buck-out/gen/caffe2/test/distributed/nn/api/remote_module_fork\#binary.par -r test_forward_async_script

buck build mode/dev-nosan //caffe2/test/distributed/nn/api:remote_module_fork && \
buck-out/gen/caffe2/test/distributed/nn/api/remote_module_fork\#binary.par -r test_forward_sync_script

buck build mode/dev-nosan //caffe2/test/distributed/nn/api:remote_module_fork && \
buck-out/gen/caffe2/test/distributed/nn/api/remote_module_fork\#binary.par -r test_forward_with_kwargs

buck build mode/dev-nosan //caffe2/test/distributed/nn/api:remote_module_fork && \
buck-out/gen/caffe2/test/distributed/nn/api/remote_module_fork\#binary.par -r test_user_provided_global_unique_name
```

```
buck test mode/dev-nosan //caffe2/test/distributed/rpc:rpc_fork
```

buck test mode/opt-asan //caffe2/test:jit -- 'test_script_forward_method_replacement

buck build mode/dev-nosan //caffe2/test:jit && \
buck-out/gen/caffe2/test/jit\#binary.par -r 'test_script_forward_method_replacement'

buck build mode/dev-nosan //caffe2/test:jit && \
buck-out/gen/caffe2/test/jit\#binary.par -r 'test_imported_classes'

Differential Revision: D20499658

fbshipit-source-id: dd9383ae4eb2343366c11127664f845b91ca3b0a
2020-06-15 19:07:35 -07:00
f37b8e73f4 [quant][graphmode] Support prim:TupleUnpack and prim::TupleConstruct (#39895)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/39895

Test Plan: Imported from OSS

Differential Revision: D22009854

fbshipit-source-id: a5dab2b4f943e5e047ba9e8573088adf66f5da6b
2020-06-15 18:55:15 -07:00
eb358f49c2 Overload complex math functions on both :: and std:: (#39829)
Summary:
Because ROCm has bug on std:: functions.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39829

Differential Revision: D22018430

Pulled By: anjali411

fbshipit-source-id: 671e158d3e3342394d1deaebd7ff011cce94c31a
2020-06-15 16:53:16 -07:00
84d8a42fdb [android] Remove android fbjni subproject (#39691)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39691

After switching on using fbjni-java-only dependency, we do not need to have gradle subproject fbjni.

Test Plan: Imported from OSS

Differential Revision: D22054575

Pulled By: IvanKobzarev

fbshipit-source-id: 331478a57dd0d0aa06a5ce96278b6c897cb0ac78
2020-06-15 15:58:18 -07:00
d602950cb4 [torch.distributed.rpc] Add WorkerInfo python repr magic method (#40004)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40004

close https://github.com/pytorch/pytorch/issues/39965
ghstack-source-id: 105891281

Test Plan:
buck test mode/opt-asan //caffe2/test:jit -- 'test_vae_quantized \(jit\.test_models\.TestModels\)'

buck test mode/dev-nosan //caffe2/test/distributed/rpc/:rpc_fork

Differential Revision: D5696583

fbshipit-source-id: 19570414dc833c38fcd1ad38d2f0a816dbf51743
2020-06-15 15:08:29 -07:00
ecfe0c9a25 [TensorPipe] Use registry for transports and channels (#39957)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39957

In order to provide a pluggable and extendable way to add new transports and channels to the TensorPipe agent, we use two registries. This allows us to separate the specific details of each backend (e.g., how it determines what address to use) from the generic logic of setting up TensorPipe.

Test Plan: Built `//caffe2:ifbpy` on two devservers, one in CLN and the other in PRN, and ran RPC across them.

Differential Revision: D22017614

fbshipit-source-id: 4ea7e6ed004a69187666f41bf59858e8174fde0d
2020-06-15 15:04:00 -07:00
51e341df4f [bernoulli_kernel] Replace CPU_tensor_apply functions with cpu_serial_kernel (#39711)
Summary:
Resolve https://github.com/pytorch/pytorch/issues/39556
Related https://github.com/pytorch/pytorch/issues/38558

Replace CPU_tensor_apply functions with cpu_serial_kernel in bernoulli_kernel, unifying bernoulli_kernel with all other kernels in `cpu/DistributionTemplates.h`.

Signed-off-by: Xiong Wei <xiongw.fnst@cn.fujitsu.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39711

Differential Revision: D22052374

Pulled By: pbelevich

fbshipit-source-id: 416334da50195b67f05a18a98971f370cba4fb0d
2020-06-15 14:11:41 -07:00
0c25428597 [futures] Reland: Add torch.futures.collect_all()/wait_all() python api. (#39964)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39964

The "[fut.wait() for fut in futs]" idiom can introduce up to
O(len(futs)) thread switches, which may be excessive for large N.

This plumbs through the new c++ c10::collectAll() to Python space
so that we only employ a single jit-side wait.

Test Plan: buck test mode/dev-nosan caffe2/test/distributed/rpc:rpc_spawn

Differential Revision: D22027412

fbshipit-source-id: 4e344a19a09638ee46e7fc478df80a41941b84ce
2020-06-15 14:07:12 -07:00
cc3fc786b7 [resubmit] [pytorch][PR] Fix for num_threads==1 in OpenMP "parallel for" (#39533)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/39533

Test Plan: CI

Reviewed By: ngimel

Differential Revision: D21889269

fbshipit-source-id: 5ba13a0a3ec11edd0b6a7c3fdb35396b847a3d9e
2020-06-15 13:14:59 -07:00
f6b0fbe2c5 topk tensor k support (#39407)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39407

 - support passing a single element tensor as k for topk module
 - support passing a single element tensor to constant fill output

Test Plan:
buck test dper3/dper3/modules/tests:core_modules_test -- test_topk_gating_without_split_examples_tensor_k
buck test caffe2/caffe2/python:hypothesis_test -- test_constant_fill_from_tensor

Reviewed By: huayuli00

Differential Revision: D21843739

fbshipit-source-id: 0c5f5c03e9f57eeba40c0068784625164c2527ec
2020-06-15 13:10:20 -07:00
4c3436838f Show which type was the wrong one when a signature is invalid (#39491)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39491

-
ghstack-source-id: 105820787

Test Plan: waitforsandcastle

Differential Revision: D21872519

fbshipit-source-id: 18f030c2b4283d6e6833d9b5164e7484137ca0fb
2020-06-15 12:58:05 -07:00
79450edad3 [JIT] IRParser: properly parse negative numbers. (#39981)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/39981

Test Plan: Imported from OSS

Reviewed By: jamesr66a

Differential Revision: D22032786

Pulled By: ZolotukhinM

fbshipit-source-id: b6c5237ac5c1c331d5053a620eb9a37a4c698125
2020-06-15 12:28:41 -07:00
42f0ea49ca [Codemod][GleanFbcode] Remove dead includes in caffe2/binaries
Reviewed By: ilia-cher

Differential Revision: D21949969

fbshipit-source-id: 80336f82e9507dd001d079644cba5012bc5c8eed
2020-06-15 12:16:52 -07:00
569c85b45d [futures] Add assert to Future constValue() accessor, add hasValue(). (#39950)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39950

Per the comment in the code, constValue() should only be used in
the case where the future was complete and value was not an error.
Add an assert to enforce this.

Also, add hasValue() accessor for completeness.
ghstack-source-id: 105815597

Test Plan: buck test mode/dev-nosan caffe2/test/cpp/jit:

Differential Revision: D22021776

fbshipit-source-id: b59b6c775eab344068a76f4cd8c3a9dc1f2a174e
2020-06-15 12:11:22 -07:00
019eeb3183 Kill DataLoader worker when we can't join (#39869)
Summary:
There still are occasional reports of DataLoader workers not exiting (e.g., https://github.com/pytorch/pytorch/issues/39570). Before we figure out why, we should just kill them if the join timesout to prevent hanging.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39869

Differential Revision: D22018501

Pulled By: ezyang

fbshipit-source-id: 66a00d0f5b3e303b6106b336949176b3ff8ac8ae
2020-06-15 11:18:23 -07:00
1d642e2adf Improve cuda error message for MSVC (#39987)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/39987

Differential Revision: D22039408

Pulled By: ezyang

fbshipit-source-id: b15f6eced0aaee1087c77564126aa304623cbed1
2020-06-15 10:28:28 -07:00
ac8d63a52f Update jenkins caffe2 scripts for ROCm circleci images. (#39908)
Summary:
Remove work-around to install conda locally for older ROCm jenkins images.
Remove use of sudo to install pip packages.
Install missing packages for caffe2 test.sh needs on ROCm.

CC ezyang xw285cornell sunway513
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39908

Differential Revision: D22044404

Pulled By: ezyang

fbshipit-source-id: da6b5a45dcf68432339ad6f1c4af2d8a96df73f1
2020-06-15 09:06:28 -07:00
c6b69a4e4d Delete Python <= 3.5 specific checks from the code (#39879)
Summary:
Remove PY3 and PY34 checks from `torch/testing/_internal/common_utils.py`
 Remove PY35 global var from `torch.jit.annotations`
Always call `try_get_real_signature` in `torch/jit/annotations.py`
Use `map` instead of `imap`, since Python-2 is no longer support, so map is always lazy.
Remove all pre Python-3.6 checks from `torch/_six.py` and `torch/_appdirs.py`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39879

Differential Revision: D22037811

Pulled By: malfet

fbshipit-source-id: af0c79f976569c2059d39ecb49c6b8285161734f
2020-06-15 08:16:06 -07:00
c8c53c802e Add generator= kwarg for DataLoader & random samplers (#39737)
Summary:
Fix https://github.com/pytorch/pytorch/issues/39572

Add `generator=` kwarg for DataLoader & random samplers

cc: SsnL, deeppatel4557, albanD, mitar
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39737

Differential Revision: D22019132

Pulled By: albanD

fbshipit-source-id: 835e08b86c5396bc0b0e41057661306b15394d6e
2020-06-15 07:01:20 -07:00
541814f2b7 Remove dead ScatterGather code (#39963)
Summary:
This code was probably left behind after an ATen port.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39963

Differential Revision: D22039370

Pulled By: ezyang

fbshipit-source-id: 4ef75bac9b69f4b508a0b09c5c1f2ebc21bd9546
2020-06-14 20:02:15 -07:00
cf64af1ad2 Revert D22036002: [pytorch][PR] Kill thrust::complex from log kernels
Test Plan: revert-hammer

Differential Revision:
D22036002

Original commit changeset: 8852a833a0c7

fbshipit-source-id: 36d3c8d0e489f8a11a6e3e9d1ae162c192748037
2020-06-14 15:30:48 -07:00
ede9bc97c3 Fix the processing logic of bernoulli on amd (#40001)
Summary:
- Fixed the bug discussed in https://github.com/pytorch/pytorch/issues/38558
- This PR is aim to make the processing of bernoulli on amd can move to the default version, even though `AT_MKL_ENABLED` is setting to `TRUE`.
- This logic used to be in the old code, but was broken by the latest update, this pr will be the fix for that.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40001

Differential Revision: D22037646

Pulled By: pbelevich

fbshipit-source-id: c0aa4ba37416d2568daf3463cfede6838ffaeac1
2020-06-14 13:46:13 -07:00
4947ee3811 Kill thrust::complex from log kernels (#39902)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/39902

Differential Revision: D22036002

Pulled By: pbelevich

fbshipit-source-id: 8852a833a0c71343ae630754f00da35a66e05917
2020-06-14 11:44:28 -07:00
5b194b0fb2 Remove thrust::complex from reciprocal (#39899)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/39899

Differential Revision: D22035940

Pulled By: pbelevich

fbshipit-source-id: 90d760317c47d2bad5a15384495a7a6cfbb655b7
2020-06-14 10:13:26 -07:00
d5236f8517 Avoid initializing unnecessary tensors in nccl.reduce (#39688)
Summary:
While working on https://github.com/pytorch/pytorch/issues/38911, I realized that `nccl.reduce` only needs a single output tensor, while our current implementation requires a list of output tensors. This, along with a TODO I fixed in reduce_add, should have some speed up for data parallel.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39688

Differential Revision: D22034547

Pulled By: mrshenli

fbshipit-source-id: e74d54d673ebbb062474b1bb5cc93a095a3a5f6c
2020-06-14 10:11:32 -07:00
8072f0685f Add zero input support for batch permutation op (#39851)
Summary:
Batch permutation op does not support zero input now, it can output a tensor the same as the input if the first dimension is zero.

This can be solved: facebookresearch/detectron2#1580
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39851

Reviewed By: houseroad

Differential Revision: D22033207

Pulled By: ppwwyyxx

fbshipit-source-id: 73b540d2182fe85ed9a47220237a8f213d68ae16
2020-06-13 21:34:24 -07:00
f1d10978a4 Added Mean and Variance calculation function. (#39986)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39986

Mean and Variance computation to match with Intel NNPI implementation.

Test Plan: Manual Testing

Reviewed By: hyuen

Differential Revision: D22008566

fbshipit-source-id: 6ac4563859b84121a2482f8e2f738be5c6111f57
2020-06-13 18:41:51 -07:00
b803b4ce09 [torch.distributed.rpc] Add stringify WorkerInfo, better error message for py_rref (#39974)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39974

# Problem

When this assertion happens, I don't know
- which worker_id it is on, even with the worker_name "trainer:0".
- which rref is throwing this exception.

```shell
  File "/mnt/xarfuse/uid-213229/96b122e4-seed-df64b884-e2b4-4520-b7a8-777e79c829ac-ns-4026532900/caffe2/torch/fb/training_toolkit/backend/training_strategies/parameter_server_strategy.py", line 246, in _initialize_trainers
    trainer_name: fut.wait() for trainer_name, fut in model_rref_futs.items()
  File "/mnt/xarfuse/uid-213229/96b122e4-seed-df64b884-e2b4-4520-b7a8-777e79c829ac-ns-4026532900/caffe2/torch/fb/training_toolkit/backend/training_strategies/parameter_server_strategy.py", line 246, in <dictcomp>
    trainer_name: fut.wait() for trainer_name, fut in model_rref_futs.items()
  File "/mnt/xarfuse/uid-213229/96b122e4-seed-df64b884-e2b4-4520-b7a8-777e79c829ac-ns-4026532900/torch/distributed/rpc/internal.py", line 158, in _handle_exception
    raise result.exception_type(result.msg)
RuntimeError: RuntimeError('Cannot call localValue() on a non-local reference. Call it on trainer:0')
Traceback (most recent call last):
  File "/mnt/xarfuse/uid-213229/96b122e4-seed-21bc7792-3714-4e62-a1c1-32a7c38ed984-ns-4026533058/torch/distributed/rpc/internal.py", line 148, in _run_function
    result = python_udf.func(*python_udf.args, **python_udf.kwargs)
  File "/mnt/xarfuse/uid-213229/96b122e4-seed-21bc7792-3714-4e62-a1c1-32a7c38ed984-ns-4026533058/torch/distributed/rpc/rref_proxy.py", line 5, in _local_invoke
    return getattr(rref.local_value(), func_name)(*args, **kwargs)
RuntimeError: Cannot call localValue() on a non-local reference. Call it on trainer:0
```

Changes,
- Add stringify WorkerInfo
- Make localValue() assertion message clearer about the case.
ghstack-source-id: 105840918

Test Plan:
buck test mode/dev-nosan //caffe2/test/distributed/rpc/:rpc_fork -- test_local_value_not_on_owner

buck test mode/dev-nosan //caffe2/test/distributed/rpc/jit/:rpc_fork

Reviewed By: mrshenli

Differential Revision: D5690653

fbshipit-source-id: ca6a8b1ff6e09f8644303a0f82f9b1a546a11170
2020-06-13 12:57:05 -07:00
905c6730b7 Adding /FS for NVCC if /Zi is used (#39994)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/39989.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39994

Differential Revision: D22034956

Pulled By: malfet

fbshipit-source-id: b26cf188eba8b796ee6e39e6adbc3e2fbb07a53a
2020-06-13 12:16:12 -07:00
e2825392b6 Update torchvision commit from Mar 11 to Jun 11 2020 (#39970)
Summary:
Mar 11 version of TorchVision still have some Python 2 anachronisms.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39970

Differential Revision: D22034738

Pulled By: malfet

fbshipit-source-id: aa281d50072e2448a6b202061f3ae8e8b65346ad
2020-06-13 10:40:51 -07:00
bdef721caf [fbcode] Add find_method into lite interpreter python binding.
Summary: Add 'find_method' into 'LiteScriptModule' python binding method, so that we use it to find existence of methods, e.g. "get_all_bundled_inputs".

Reviewed By: linbinyu, houseroad

Differential Revision: D22029002

fbshipit-source-id: 9acf76880fc989e825dc3a9186dab6928caee75e
2020-06-13 07:48:13 -07:00
ddd45ae919 Extend int8 FC op to take scale and zero point from input
Summary: Extend int8 FC op to take scale and zero point from input to support int8 PTQ productization of online training models.

Test Plan: buck test caffe2/caffe2/quantization/server:fully_connected_dnnlowp_op_test

Reviewed By: csummersea

Differential Revision: D21944884

fbshipit-source-id: 2094827da903f3993afe4f8cf6e70286b195321d
2020-06-13 02:34:45 -07:00
8d3fcb43cf Revert D22008317: [Reland] Include AT_PARALLEL_OPENMP/AT_PARALLEL_NATIVE/AT_PARALLEL_NATIVE_TBB to ATen/Config.h
Test Plan: revert-hammer

Differential Revision:
D22008317

Original commit changeset: b25714d8643c

fbshipit-source-id: 01f9f0a996cecba1fb2e7c0eb51ba32ea1fcf73e
2020-06-12 22:39:10 -07:00
db2b273d1f Reland: Fix CUDA device guard usage when first arg of kernel is scalar (#39956)
Summary:
Reland PR https://github.com/pytorch/pytorch/issues/39870

Closes https://github.com/pytorch/pytorch/issues/38889
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39956

Differential Revision: D22027956

Pulled By: ngimel

fbshipit-source-id: e6029f450e2da3782b2d05bcc2012c19b82291da
2020-06-12 21:41:53 -07:00
e62d655744 [Reland] Include AT_PARALLEL_OPENMP/AT_PARALLEL_NATIVE/AT_PARALLEL_NATIVE_TBB to ATen/Config.h (#39881)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/39881

Test Plan: Imported from OSS

Differential Revision: D22008317

Pulled By: pbelevich

fbshipit-source-id: b25714d8643cf584bb3331d70e44f4df06c1b615
2020-06-12 19:15:13 -07:00
34d1098dc2 [rpc] fix RRef alias annotation (#39933)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39933

Fix rref related alias annotation to ensure it's not getting ebased by
the jit dce.

Test Plan: Imported from OSS

Differential Revision: D22015426

Pulled By: wanchaol

fbshipit-source-id: 3e74d49fa9f88abaf662bde7be5284f01f621b98
2020-06-12 17:17:48 -07:00
356d564886 [rpc] use annotation_str for RRef type serialization (#39932)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39932

This PR make RRef fork to use the jit type annotation_str recently introduce in
https://github.com/pytorch/pytorch/pull/39544 to allow consistent
serialization type str format, and fix the case when dict->str() format
not match the type resolver.

Test Plan: Imported from OSS

Differential Revision: D22015427

Pulled By: wanchaol

fbshipit-source-id: f64d7e3acde5312813816c8f3c7d8fa9379704e8
2020-06-12 17:15:57 -07:00
d9539cd835 [testing] Dont use zipfile for storage __reduce__ (#39893)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39893

ghstack-source-id: 105721911

Test Plan: waitforsadcastle

Differential Revision: D22003583

fbshipit-source-id: 864c3e36fc79412334ab3887c9776eaaabc5a315
2020-06-12 16:48:12 -07:00
727e77a809 [quant] Enable reduce_range for graphmode (#39874)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39874

When fbgemm backend is set we make sure reduce_range is set to true to avoid overflow in the operator
Also adds test for per-channel quant with graph mode and compare numerics with eager mode

Test Plan:
python test/test_quantization.py

Imported from OSS

Differential Revision: D22011205

fbshipit-source-id: 1c7c9b7ab0d84200e3d8d85c34978554c30c0169
2020-06-12 16:25:58 -07:00
b2620722c3 Kill meanall from TH, THC (#39907)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/39907

Differential Revision: D22022995

Pulled By: ngimel

fbshipit-source-id: 0d9f3b61c28af7a8b1296fb9588b07a708b347f7
2020-06-12 15:16:50 -07:00
8749aaca83 Abort setup_pytorch_env.bat if one of the steps failed (#39951)
Summary:
It's pointless to continue if `conda install`, Visual Studio or `pip install` commands have failed
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39951

Differential Revision: D22026240

Pulled By: malfet

fbshipit-source-id: de982e9d328e3fd7d9f0bd14400c0116b3010281
2020-06-12 14:32:57 -07:00
8bc821f0d0 Revert D21976891: [futures] Add torch.futures.collect_all()/wait_all() python api.
Test Plan: revert-hammer

Differential Revision:
D21976891

Original commit changeset: 253c61f503f4

fbshipit-source-id: f839b16f4469e96325b607b6313a1397e1988856
2020-06-12 13:40:37 -07:00
14099374bd Update TensorPipe submodule (#39945)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39945

In order to pick up 8fb1fe66f8.

Test Plan: Export to CircleCI and make sure tests pass.

Reviewed By: patricklabatut

Differential Revision: D22019033

fbshipit-source-id: eb192ea3950e4f27ed222f84e2d9de8bf6eb927c
2020-06-12 12:57:53 -07:00
a9aa6367c2 [futures] Add torch.futures.collect_all()/wait_all() python api. (#39790)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39790

The "[fut.wait() for fut in futs]" idiom can introduce up to
O(len(futs)) thread switches, which may be excessive for large N.

This plumbs through the new c++ c10::collectAll() to Python space
so that we only employ a single jit-side wait.
ghstack-source-id: 105779443

Test Plan: buck test mode/dev-nosan caffe2/test/distributed/rpc:rpc_spawn

Reviewed By: kiukchung

Differential Revision: D21976891

fbshipit-source-id: 253c61f503f4ffb9be784e6c49a0656cede139fb
2020-06-12 12:36:04 -07:00
0d19ae5a14 [pytorch] fix (ProfiledType|TraceType)None.cpp (#39934)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39934

Shouldn't write .cpp file when it's called to produce header file.

Test Plan: Imported from OSS

Differential Revision: D22016596

Pulled By: ljk53

fbshipit-source-id: 30a1b4a527bc1ffd8ee748c70494fe712be60c4f
2020-06-12 11:53:01 -07:00
fdf6d37895 re-enable some corner cases in memory format transpose test (#39891)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/39891

Test Plan: Imported from OSS

Differential Revision: D22009141

Pulled By: glaringlee

fbshipit-source-id: aaf24fdf080855bbe9d2fe5082aa8e92c6973f34
2020-06-12 11:46:57 -07:00
558c20f50a Int8 PTQ ops for online training (#39818)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39818

Add histogram collection and qparam update support for Int8 PTQ during online training
Add caffe2 wrappers for generating int8 quant params based on output activation samples from the LastNWindowCollector op.

Test Plan:
```
buck test mode/opt caffe2/caffe2/quantization/server:int8_gen_quant_params_test
```

Reviewed By: hx89

Differential Revision: D21984455

fbshipit-source-id: 9479c87a5b1867aec662ecd21fe7ad2bc7e8652c
2020-06-12 11:41:30 -07:00
99084104b6 [quant][graphmode][refactor] isScalar check (#39892)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/39892

Test Plan: Imported from OSS

Differential Revision: D22009856

fbshipit-source-id: fbc407499bcff0f25e44eedba3d6cd1225325c24
2020-06-12 10:53:35 -07:00
1e05e5e0ae Correct #39759 for HIP. (#39801)
Summary:
Changes in PR https://github.com/pytorch/pytorch/issues/39759 broke HIP caffe2.
hipify for caffe2 renames CUDA to HIP; torch does not.
If caffe2 calls into torch, it needs to use CUDA-named functions.

CC ezyang xw285cornell sunway513 houseroad dzhulgakov
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39801

Differential Revision: D21982493

Pulled By: xw285cornell

fbshipit-source-id: 8e88e0fb80c71f0342e23ef0214a42d5542bdc70
2020-06-12 10:34:28 -07:00
f3f9415f81 Add file_name argument to load_state_dict_from_url (#39749)
Summary:
Add the feature proposed here https://github.com/pytorch/pytorch/issues/39196
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39749

Differential Revision: D21962736

Pulled By: ailzhang

fbshipit-source-id: b60fb0d83fd0728354a46e2762cc3598b14b1fdb
2020-06-12 10:31:22 -07:00
baa604812c add optional request headers to torch.hub (#39740)
Summary:
Closes https://github.com/pytorch/pytorch/issues/39657.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39740

Differential Revision: D22005273

Pulled By: ailzhang

fbshipit-source-id: eb33147386fb4befb0278c788111306aa878ca39
2020-06-12 10:25:26 -07:00
d367f575b9 [CI][vulkan] android build abi x86 with USE_VULKAN (#39912)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39912

Reland of https://github.com/pytorch/pytorch/pull/39767
What was wrong:
android_x86_32_vulkan job used the same docker image as android_x86_32
As a result vulkan job did commit and following android_gradle used libpytorch.so with USE_VULKAN, while vulkan wrapper was not added to the linking of libpytorch_jni

Fix: To commit to different docker images
```
            elif [[ ${BUILD_ENVIRONMENT} == *"android-ndk-r19c-vulkan-x86_32"* ]]; then
              export COMMIT_DOCKER_IMAGE=$output_image-android-vulkan-x86_32
```

Test Plan: Imported from OSS

Differential Revision: D22012951

Pulled By: IvanKobzarev

fbshipit-source-id: 27908f630e6ce3613679a50b4c10f8b246718894
2020-06-12 10:04:24 -07:00
2bab9149cc Extend int8 quantize op to take scale and zero point from input
Summary: Extend int8 quantize op to take scale and zero point from input to support int8 PTQ productization of online training models.

Test Plan: buck test caffe2/caffe2/quantization/server:quantize_dnnlowp_op_test

Reviewed By: csummersea

Differential Revision: D21939660

fbshipit-source-id: 7ce2fbf9cd8a990c270f2187a49b1578ce76bc37
2020-06-12 09:28:51 -07:00
48678aa39f pin ninja version to fix windows CI (#39944)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/39944

Differential Revision: D22019435

Pulled By: albanD

fbshipit-source-id: 9d0933066526b3e3fc4896457241f0877099291b
2020-06-12 08:56:45 -07:00
4574abc395 Replace __host__ __device__ with C10_HOST_DEVICE in THCIntegerDivider.cuh (#39797)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39797

This change allows using [OffsetCalculator.cuh](https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/cuda/detail/OffsetCalculator.cuh) on both CPU and CUDA.

It is required by (and manually tested with) https://github.com/pytorch/csprng/pull/28

Test Plan: Imported from OSS

Differential Revision: D22002571

Pulled By: pbelevich

fbshipit-source-id: 0b85e465777f6613f3b3ba91873da57dab949c54
2020-06-12 08:50:21 -07:00
124cdf2290 Add experimental deterministic flag (#38683)
Summary:
Adds `torch.experimental.deterministic` flag to enforce deterministic algorithms across all of pytorch.
Adds `torch.experimental.deterministic_error_level` to allow users to choose between error/warning/silent if determinism for an operation is not available.
Adds `torch.experimental.alert_not_deterministic()` which should be called within operations that are not deterministic.
Offers both Python and ATen interfaces

Issue https://github.com/pytorch/pytorch/issues/15359
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38683

Differential Revision: D21998093

Pulled By: ezyang

fbshipit-source-id: 23aabbddd20f6199d846f97764ff24d728163737
2020-06-12 08:44:06 -07:00
004aa089a6 [jit][subgraph_rewriter] Support list of filters (#39867)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39867

Support list of filters in subgraph rewriter, the rewrite will execute only
when the match passes all filter check. this is useful for different matches
to share the same filter.

Test Plan: Imported from OSS

Differential Revision: D22009855

fbshipit-source-id: 67aab8d6326b2011a9061397699dc62ee9ad4e2d
2020-06-12 08:24:49 -07:00
3876889218 Remove LegacyComplex.h (#39834)
Summary:
All std::complex has been migrated to c10::complex
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39834

Differential Revision: D22001969

Pulled By: ezyang

fbshipit-source-id: 665a9198afde45a95309053b2f2381e123bf869a
2020-06-12 08:18:25 -07:00
ae6a68ad09 [TensorPipe] Add extensive logging (#39781)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39781

Use a new feature of TensorPipe where a pipe can tell you the name of the remote endpoint, in order to make the logging messages more informative: whenever there is a failure on a pipe, say which worker this was to/from, and the ID of the message involved.

Also, add plenty of verbose logging, to help with debugging. This is off by default, but can be enabled by setting the `GLOG_v` env var to a value of 1 or higher.
ghstack-source-id: 105777704

Test Plan: Builds.

Differential Revision: D21973150

fbshipit-source-id: 9e3ce1b9977e1e9ecd91ff4a6fe82786dc79a702
2020-06-12 07:11:09 -07:00
52cc0c2c37 Revert D22011184: [pytorch][PR] Fix CUDA device guard usage when first arg of kernel is scalar
Test Plan: revert-hammer

Differential Revision:
D22011184

Original commit changeset: 427291c456e8

fbshipit-source-id: 7d4979e98bbd9294b91da255ecfc063615741630
2020-06-12 06:46:11 -07:00
246d7bb41d [quant][graphmode] Quantizing traced modules (#39826)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39826

Expanding operator test coverage to traced modules

Test Plan: Imported from OSS

Differential Revision: D21991266

fbshipit-source-id: 73b1d94caa6ad41bb0d6cbde7ba0de343da3e7ff
2020-06-12 00:55:11 -07:00
c068233300 Add CHECK-SOURCE-HIGHLIGHTED to file check utils. (#39692)
Summary:
Enhance FileCheck util to check for highlighted source ranges. This is useful when writing tests regarding generated error messages that require source code highlighting.

Here is how the error looks like in different cases:

- In case of needed source code token not found at all in input string:
```
RuntimeError: Expected to find "invalid_token" but did not find it
Searched string:

...  <--- HERE
def to_list_missing_type_annotation(x):
    # type: (torch.Tensor) -> List[float]
From CHECK-SOURCE-HIGHLIGHTED: invalid_token
```

- In case of source code token not highlighted:
```
Traceback (most recent call last):
  File "test_range.py", line 11, in <module>
    FileCheck().check_source_highlighted("x.tolist()").run(s)
RuntimeError: Expected to find "~~~~~~~~~~" but did not find it
Searched string:
    # type: (torch.Tensor) -> List[float]
    li = x.tolist()
         ~~~~~~~~~ <--- HERE
         ~~~~~~~~~~~~~~~~~~~...  <--- HERE
    return li
```

It is a bit confusing since both input text (usually an error message) and generated error messages have their highlighted portions, but this is consistent of previous behavior. Another option is to generate plain error messages without additional range highlighting on input text.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39692

Test Plan:
Added unit test.

Closes https://github.com/pytorch/pytorch/issues/38698

Differential Revision: D22001765

Pulled By: gmagogsfm

fbshipit-source-id: 6681441eee5853ab061d198ccfe55ebffddca202
2020-06-11 23:47:07 -07:00
0526af1af6 [vulkan] Conv2d with optional clamp (#39115)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39115

Conv2d with optional clamp, to handle fused conv2d_clamp from prepacking

Test Plan: Imported from OSS

Differential Revision: D21962427

Pulled By: IvanKobzarev

fbshipit-source-id: 8aec7ae22dff6ed3896011ebc218292b5503a69b
2020-06-11 23:43:34 -07:00
71372b452a [vulkan] addmm support non-vulkan inputs (#39078)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39078

Adding support of non-vulkan inputs for addmm operator:
if it is not on vulkan - converting to it inside operator,

if we run torchscript pretrained model - weights of linear op will be on CPU, we need this to run mobilenetV2 on Vulkan backend

Test Plan: Imported from OSS

Differential Revision: D21962425

Pulled By: IvanKobzarev

fbshipit-source-id: 8222edd31dfb14b326d15e6fec5c8778783479df
2020-06-11 23:41:12 -07:00
80e5ebf989 [nvFuser] Transform replay refactor and minor updates (#39579)
Summary:
We've got quite a few things going on, preparing a push back to upstream so we don't get too desynced.

- Major refactor of transform replay. It is now far more robust and fixes bugs discovered in reductions. Preparing for extension to explicit broadcast ops which will be the last major memory pattern for op coverage. Broadcast ops will allow us to express up to and potentially beyond norms and gemms.

- Initial runtime expression evaluator. This allows us to evaluate expressions at runtime. Will be useful for determining our grid/block layout at runtime, so we don't have to manually compute them according to the code we're trying to generate.

- Moving to int64 and double for scalar representations to match PyTorch JIT.

- Improvements in codegen interface where we return Tensor like object instead of parent class Val.

- Add `addcmul` and `lerp` ops

- General updates, fixes, test additions, test inprovements.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39579

Differential Revision: D21974001

Pulled By: soumith

fbshipit-source-id: 7f7ccc91593466e948f3ce90f8f9b7fbc5c28de2
2020-06-11 23:04:24 -07:00
2854811ab8 [JIT] Allow self-referential annotations in classes (#39821)
Summary:
**Summary**
This commit adds support for annotations in method signatures of a
TorchScript class types that refer to the class being defined itself.

**Test Plan**
This commit adds a unit test to check that a method that uses
self-referential type annotations can be defined and produces the same
results in Python and TorchScript.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39821

Differential Revision: D22003624

Pulled By: SplitInfinity

fbshipit-source-id: dce921c2e0ca0c8aecb52d5b0646b419eb207146
2020-06-11 22:11:27 -07:00
14e841c292 [quant][graphmode] Remove dedup pass (#39825)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39825

Removing the pass for now since it is causing error for some models

Test Plan: Imported from OSS

Differential Revision: D21987878

fbshipit-source-id: 129aefb34754d5390a4c9d3108fa1b6c2eae5a74
2020-06-11 21:46:24 -07:00
2cd27be5b5 Fix CUDA device guard usage when first arg of kernel is scalar (#39870)
Summary:
Add an OptionalDeviceGuard for second arg in gpu_kernel_with_scalars when first arg is scalar

Closes https://github.com/pytorch/pytorch/issues/38889
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39870

Differential Revision: D22011184

Pulled By: ngimel

fbshipit-source-id: 427291c456e879f25d15ab76a60b5d4ad61f3b3f
2020-06-11 20:08:43 -07:00
b10c53e9b8 Vectorize on output for reduction kernels (#37206)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37206

Benchmark on P100: https://github.com/zasdfgbnm/things/blob/master/2020Q2/reduction-benchmark-vectorize-output.ipynb

```python
import torch
print(torch.__version__)
print()

for i in range(1000):
    torch.arange(10000, device='cuda')

def benchmark(dtype, i):
    size0 = 2 ** (i // 2)
    size1 = 2 ** ((i + 1) // 2)
    a = torch.zeros(size0, size1, device='cuda', dtype=dtype)
    torch.cuda.synchronize()
    %timeit a.sum(dtype=dtype, dim=0); torch.cuda.synchronize()

for dtype in [torch.int8, torch.half, torch.float, torch.double]:
    print(dtype)
    for i in range(18, 30):
        benchmark(dtype, i)
    print()
```
Before
```
1.5.0a0+3bbb36e

torch.int8
24.5 µs ± 111 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
24.1 µs ± 216 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
26.1 µs ± 133 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
30.9 µs ± 132 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
39 µs ± 504 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
59.6 µs ± 244 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
111 µs ± 112 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
186 µs ± 300 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
397 µs ± 791 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
665 µs ± 1.06 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
1.45 ms ± 837 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
3.03 ms ± 2.79 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

torch.float16
24.2 µs ± 66.2 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
24.6 µs ± 255 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
27.2 µs ± 53.6 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
32 µs ± 91 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
48.1 µs ± 89.6 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
66.9 µs ± 66.5 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
121 µs ± 102 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
218 µs ± 384 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
431 µs ± 554 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
854 µs ± 1.23 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
1.75 ms ± 1.05 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
3.63 ms ± 849 ns per loop (mean ± std. dev. of 7 runs, 100 loops each)

torch.float32
24.2 µs ± 117 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
24.4 µs ± 237 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
29.3 µs ± 34.6 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
40.5 µs ± 36.4 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
57.4 µs ± 44.1 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
85.5 µs ± 41.5 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
158 µs ± 106 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
288 µs ± 181 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
557 µs ± 904 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
1e+03 µs ± 1.27 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
1.98 ms ± 533 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
3.8 ms ± 1.98 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

torch.float64
25 µs ± 54.4 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
26.9 µs ± 320 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
37.1 µs ± 51.8 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
54.3 µs ± 45.1 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
84.9 µs ± 65.5 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
139 µs ± 68.6 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
275 µs ± 235 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
504 µs ± 702 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
987 µs ± 613 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
1.84 ms ± 1.16 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
3.64 ms ± 2.44 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
7.19 ms ± 1.19 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
```
After
```
1.5.0a0+3bbb36e

torch.int8
29.8 µs ± 213 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
30.7 µs ± 1.41 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
33.4 µs ± 4.48 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
32.5 µs ± 110 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
40.6 µs ± 94.9 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
53.7 µs ± 66.5 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
68 µs ± 69.1 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
98.2 µs ± 88.6 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
158 µs ± 116 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
283 µs ± 120 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
522 µs ± 563 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
967 µs ± 495 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)

torch.float16
29.4 µs ± 68.2 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
29.2 µs ± 45.6 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
30.8 µs ± 41 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
35.3 µs ± 20.3 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
50.1 µs ± 133 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
70.4 µs ± 67.6 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
101 µs ± 325 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
157 µs ± 179 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
275 µs ± 791 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
486 µs ± 122 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
936 µs ± 211 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
1.85 ms ± 124 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)

torch.float32
29.9 µs ± 36.7 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
29.5 µs ± 108 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
33 µs ± 93.9 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
46 µs ± 37.7 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
64 µs ± 73.5 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
99.4 µs ± 82.5 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
157 µs ± 74.1 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
265 µs ± 68.8 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
490 µs ± 319 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
960 µs ± 669 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
1.84 ms ± 632 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
3.6 ms ± 1.63 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

torch.float64
33.1 µs ± 74.9 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
36.7 µs ± 86.7 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
46.7 µs ± 39.7 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
61.6 µs ± 196 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
100 µs ± 23.2 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
158 µs ± 202 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
270 µs ± 332 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
491 µs ± 445 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
939 µs ± 339 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
1.88 ms ± 1.09 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
3.65 ms ± 5.18 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
7.3 ms ± 7.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
```

Test Plan: Imported from OSS

Differential Revision: D21233255

Pulled By: ngimel

fbshipit-source-id: d468fddbb228c0c13146dfc6344c470513f9e374
2020-06-11 19:44:17 -07:00
a92231b70e Typo in Dispatch.h (#39882)
Summary:
std::complex is gone, now we are using c10::complex on all dispatch macros.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39882

Differential Revision: D22009933

Pulled By: pbelevich

fbshipit-source-id: 613ac06d0024f149184d0b2e08ed06d7d6066017
2020-06-11 19:35:50 -07:00
bbf364b0c1 move basic math ops to lite interpreter (#39861)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39861

move some basic math ops to lite interpreter
size change should be small

Test Plan: build

Reviewed By: iseeyuan

Differential Revision: D21992552

fbshipit-source-id: 7f5a7380ffc1519001a98169e6c5381e45e8e0ea
2020-06-11 18:55:58 -07:00
bdecedd2d7 [JIT] use python type resolver for all types (#39880)
Summary:
Follow up to https://github.com/pytorch/pytorch/pull/39269, use the python resolver for all type resolutions.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39880

Reviewed By: jamesr66a

Differential Revision: D22008110

Pulled By: eellison

fbshipit-source-id: 4e62eb0867d79e7d45156f88d628a28fa1578b9e
2020-06-11 18:36:31 -07:00
0aa70039f9 Delete redundant device/dtype in TensorIterator add_input/add_output (#39798)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39798

add_input's device/dtype are 100% redundant, as compute_types will
always (internal) assert that this dtype matches the expected dtype.
add_output's device/dtype is redundant UNLESS you have an undefined
tensor (in which case it seems to be an indication what the output type
should be). The one add_output case I killed can never be exercised, see:

```
import torch
x = torch.randn(3, 4)
mask = x.ge(0.5)
torch.masked_select(x.cuda(), mask.cuda(), out=torch.zeros((0), dtype=torch.int64, device='cuda'))
```

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Differential Revision: D21981742

Pulled By: ezyang

fbshipit-source-id: a042d1b9fce0ad58b833856ffe32001787551e59
2020-06-11 17:32:34 -07:00
f59e38974a fix multinomial for empty batch (#39873)
Summary:
Per title
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39873

Reviewed By: ailzhang

Differential Revision: D22004830

Pulled By: ngimel

fbshipit-source-id: 0274cd2ee40e84f06b34e7b53329e95d05a9ddd4
2020-06-11 17:26:39 -07:00
8c8d9f8971 Move pip install after setting up VS environment (#39898)
Summary:
This is needed, because pip might want to build ninja from source
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39898

Differential Revision: D22010548

Pulled By: malfet

fbshipit-source-id: 55423324c381aaec8a3c81f95f9405dd618b4e49
2020-06-11 17:15:03 -07:00
63dc1363e6 [TensorExpr] Eliminate Cond statements when each branch is a different kind of empty (#39754)
Summary:
Fix another simplification edge case, a Cond statement when one branch is nullptr and the other is a zero stmt block. This happens mostly with an if with no else branch where all statements inside the if are removed (eg via inlining or simplification). Common case is SplitWithMask -> ComputeInline.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39754

Differential Revision: D21962987

Pulled By: nickgg

fbshipit-source-id: 2461415466fbbab88d2329061f90fcfdfa85e243
2020-06-11 17:08:14 -07:00
36501ff5d9 [vulkan] VulkanTensor, add strides in interface (#39077)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39077

We plan to support strides for Vulkan, but that is not implemented yet.
The main intention of this faking of strides and is_contiguous - to be able to run torchscript mobilenetV2 on Vulkan backend for development and profiling.

This change adds strides to Vulkan interface and overrides strides(), stride(), is_contiguous() of OpaqueTensorImpl for that purpose.

Test Plan: Imported from OSS

Differential Revision: D21962426

Pulled By: IvanKobzarev

fbshipit-source-id: cfef4903ad7062170926264f45cff1293ade78f6
2020-06-11 16:38:06 -07:00
eace053398 Move all torch.nn.modules type annotations inline (#38211)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38211

Just because the annotations are inline doesn't mean the files type
check; most of the newly annotated files have type errors and I
added exclusions for them in mypy.ini.  The payoff of moving
all of these modules inline is I can delete the relevant code
generation logic for the pyi files (which was added ignore
annotations that weren't actually relevant anymore.)

For the most part the translation was completely mechanical, but there
were two hairy issues.  First, I needed to work around a Python 3.6 and
earlier bug where Generic has a nontrivial metaclass.  This fix is in
torch/jit/__init__.py.  Second, module.py, we need to apply the same
fix for avoiding contravariance checks that the pyi file used to have;
this is done by declaring forward as a variable (rather than a
function), which appears to be sufficient enough to get mypy to not
contravariantly check input arguments.

Because we aren't actually typechecking these modules in most
cases, it is inevitable that some of these type annotations are wrong.
I slavishly copied the old annotations from the pyi files unless there
was an obvious correction I could make.  These annotations will probably
need fixing up later.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Differential Revision: D21497397

Pulled By: ezyang

fbshipit-source-id: 2b08bacc152c48f074e7edc4ee5dce1b77d83702
2020-06-11 15:59:57 -07:00
e22dd561ad Migrate pow kernel to c10::complex (#39286)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/39286

Test Plan: Imported from OSS

Differential Revision: D21999454

Pulled By: pbelevich

fbshipit-source-id: c8f1ba4ff4ec66ffbc283700cabb6794e6b2896a
2020-06-11 15:49:30 -07:00
b5848833f0 Add runtime check for MSVC redist (#39841)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/39734.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39841

Differential Revision: D21998020

Pulled By: ezyang

fbshipit-source-id: 77df537045e4d7e718ab34e35bb6f847638f4b01
2020-06-11 15:37:21 -07:00
9ca7fdcef0 Attempt to fix macos ci by pinning numba (#39875)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39875

Numba released a new version (0.50) that is causing problems with
librosa (we use this as a test dependency). Try pinning the version of
numba to temporarily fix. I am not actually sure if this is going to
work because it is unclear when we actually install numba.

Test Plan: - wait for CI.

Reviewed By: mruberry

Differential Revision: D22005838

Pulled By: zou3519

fbshipit-source-id: 4bccfa622c82533d85631052e4ad273617ea31d7
2020-06-11 15:25:54 -07:00
0b90b9cdd3 Allow shuffle when auto-batching disabled in DataLoader (#39865)
Summary:
Fix https://github.com/pytorch/pytorch/issues/35761
cc SsnL

Note: closed the other PR for this new branch.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39865

Differential Revision: D22003612

Pulled By: ezyang

fbshipit-source-id: 26aecd1b298fe99d3924f4c8157cd6cae2561c7c
2020-06-11 15:17:46 -07:00
ae3567427f .circleci: Remove upload_binary_sizes job (#39786)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39786

This wasn't really used since we already have an internal SCUBA table to
handle this use case and it doesn't rely on a singular script to run
after all binaries have been uploaded.

Also the web page took an enormously long time to actually load
decreasing its usefulness, let's just get rid of the job altogether
instead of trying to fix something no one really looked at

Signed-off-by: Eli Uriegas <eliuriegas@fb.com>

Test Plan: Imported from OSS

Differential Revision: D22007197

Pulled By: seemethere

fbshipit-source-id: d824b576e07c9cf1603db5ac14940b06ecdd2a0e
2020-06-11 15:00:12 -07:00
32bf63890b Revert D21992267: [pytorch][PR] [ONNX] Export linspace
Test Plan: revert-hammer

Differential Revision:
D21992267

Original commit changeset: 3a4093703570

fbshipit-source-id: 09c8cddd8cac3bb1cfa2b5f1abf2af3c45d8a3b1
2020-06-11 14:46:02 -07:00
8893c0670d Revert D21511611: Wrap Caffe2 (RowWise)SparseAdagrad fusion operator as a PT op
Test Plan: revert-hammer

Differential Revision:
D21511611

Original commit changeset: 1a0bb8252ec0

fbshipit-source-id: 462cb20585cf3ad415c80f0f1d77e5fdba2a6ea1
2020-06-11 14:40:18 -07:00
91d539097b [ONNX] Fix regression disabling checker (#39073)
Summary:
Fix regression disabling checker. Checker should be enabled for ONNX export type only.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39073

Reviewed By: hl475

Differential Revision: D21992276

Pulled By: houseroad

fbshipit-source-id: 79c671fc4af9e6d28e8957e04ae205f42f4bb38a
2020-06-11 14:03:18 -07:00
85f1f67f33 Wrap Caffe2 (RowWise)SparseAdagrad fusion operator as a PT op (#38704)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38704

This diff wraps Caffe2's (RowWise)SparseAdagrad fusion operator on GPU as a PT op.

Reviewed By: jianyuh, xw285cornell

Differential Revision: D21511611

fbshipit-source-id: 1a0bb8252ec0a8229eb80708338cb23008cfb26d
2020-06-11 13:56:02 -07:00
01986e9890 Wait for all op types in SimpleNet (#39493)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39493

Make sure we wait for all types, incl. async cpu ops

Test Plan: CI

Reviewed By: kennyhorror

Differential Revision: D21873540

fbshipit-source-id: 37875cade68e1b3323086833f8d4db79362a68e8
2020-06-11 13:00:34 -07:00
7a792879f2 Prevent clobbering of docker images by parallelnative/paralleltbb builds (#39863)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39863

The paralleltbb and parallelnative builds use the same docker image as
pytorch-linux-trusty-py3.6-gcc5.4-build
(da3073e9b1/.circleci/config.yml (L6752))

Therefore they should push to a different intermediate docker image for
the next phase (testing), according to
(da3073e9b1/.circleci/config.yml (L434-L439)).

However, they're not actually included in that list.

We've found evidence of what looks like clobbering in recent CI jobs
(https://circleci.com/gh/pytorch/pytorch/5787534?utm_campaign=vcs-integration-link&utm_medium=referral&utm_source=github-build-link),
(https://circleci.com/gh/pytorch/pytorch/5787763?utm_campaign=vcs-integration-link&utm_medium=referral&utm_source=github-build-link)

This PR adds parallelnative and paralleltbb to the list to prevent
clobbering.

Test Plan:
- Wait for CI tests to pass on this PR.
- The paralleltbb and parallelnative builds don't actually run on PRs.
So I think the plan here is to yolo land and hope it works.

Differential Revision: D22002279

Pulled By: zou3519

fbshipit-source-id: 89f05cedb2fbde2dc3033458b1e0c5be0b194955
2020-06-11 12:48:32 -07:00
2b29feace4 [TensorExpr] Fix IRPrinter for function calls with uniqued names (#39753)
Summary:
IRPrinter was using the name_hint rather than the uniqued name when printing FunctionCalls, leading to output that appeared incorrect.

E.g. for the following set of tensorexprs:
```
producer[M, N] = M * N;
chunk[M, N/2] = producer[M, N];
chunk_1[M, N/2] = producer[M, N + N/2];
consumer[M, N] = chunk_1[M, N];
```

Before fix:
```
{
  for (int m = 0; m < 4; m++) {
    for (int n = 0; n < 20; n++) {
      producer[m, n] = m * n;
    }
  }
  for (int m_1 = 0; m_1 < 4; m_1++) {
    for (int n_1 = 0; n_1 < 10; n_1++) {
      chunk[m_1, n_1] = producer(m_1, n_1);
    }
  }
  for (int m_2 = 0; m_2 < 4; m_2++) {
    for (int n_2 = 0; n_2 < 10; n_2++) {
      chunk_1[m_2, n_2] = producer(m_2, n_2 + 10);
    }
  }
  for (int i = 0; i < 4; i++) {
    for (int j = 0; j < 10; j++) {
      consumer[i, j] = i * (chunk(i, j));          <----- HERE!
    }
  }
}
```

After fix:
```
{
  for (int m = 0; m < 4; m++) {
    for (int n = 0; n < 20; n++) {
      producer[m, n] = m * n;
    }
  }
  for (int m_1 = 0; m_1 < 4; m_1++) {
    for (int n_1 = 0; n_1 < 10; n_1++) {
      chunk[m_1, n_1] = producer(m_1, n_1);
    }
  }
  for (int m_2 = 0; m_2 < 4; m_2++) {
    for (int n_2 = 0; n_2 < 10; n_2++) {
      chunk_1[m_2, n_2] = producer(m_2, n_2 + 10);
    }
  }
  for (int i = 0; i < 4; i++) {
    for (int j = 0; j < 10; j++) {
      consumer[i, j] = i * (chunk_1(i, j));          <----- HERE!
    }
  }
}
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39753

Differential Revision: D21962441

Pulled By: nickgg

fbshipit-source-id: caa429caf0df9c7b16e109937412587bff6dc886
2020-06-11 12:13:28 -07:00
7957d83498 [ONNX] Export linspace (#39403)
Summary:
Adding linspace symbolic for opset 11
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39403

Reviewed By: hl475

Differential Revision: D21992267

Pulled By: houseroad

fbshipit-source-id: 3a40937035703754045bb5e22ac5edf721453c25
2020-06-11 12:08:19 -07:00
4c7d81f847 Add documentation for properties in TensorIterator. (#39792)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39792

I also deleted the dead TensorIterator::remove_dimension,
and reordered some properties so they were more logically
grouped.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Differential Revision: D21981739

Pulled By: ezyang

fbshipit-source-id: e7c9ad0233284f7c47322e62035edb704640aafd
2020-06-11 12:04:17 -07:00
f33c31eace Separate "configuration" properties in TensorIterator (#39789)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39789

Some properties on TensorIterator are only set prior to build() by the
user and then immutable during the build process.  I've renamed all such
properties so that they have a config_ prefix, gave them an explicit
accessor and audited every site to ensure they are only written once.

I also renamed check_mem_overlaps to compute_mem_overlaps to avoid
confusion with the accessor check_mem_overlap.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Differential Revision: D21981741

Pulled By: ezyang

fbshipit-source-id: b64e33a5d0bc01834ead6d7082605c20a5ed1a08
2020-06-11 12:01:32 -07:00
1f027ac02d Disable testTHCAllocator on HIP (#39843)
Summary:
THCAllocator functionality is pretty obscure and it's hard to get it working with HIP because of how Caffe2/PyTorch rules are set up (see https://github.com/pytorch/pytorch/issues/39801). Let's just disable the test.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39843

Reviewed By: zou3519

Differential Revision: D21998687

Pulled By: dzhulgakov

fbshipit-source-id: cd12ba30cdfee658b98393ed3a72e83f4ecf1c9c
2020-06-11 11:36:17 -07:00
ad91a3a11f Skipping L2 regularization on sparse biases
Summary:
# Motivations
As explained in the [link](https://stats.stackexchange.com/questions/86991/reason-for-not-shrinking-the-bias-intercept-term-in-regression/161689#161689), regularizing biases will cause mis-calibration of predicted probabilities.
In SparseNN, the unary processor may use 1d embedding tables for the sparse features to serve as biases.
In this diff, the regularization term is automatically skipped for the 1d sparse parameters to avoid regularizing biases.

# Experiments
Experiments were conducted to verify that it has no significant impact on the NE to skip the regularization on 1d sparse parameters.
Baseline.1 (no L2 regularization): f193105372
Baseline.2 (L2 regularization in prod): f193105522
Treatment (skipping L2 regularization on 1d sparse params): f193105708

{F239859690}

Test Plan:
Experiments were conducted to verify that it has no significant impact on the NE to skip the regularization on 1d sparse parameters using a canary package: `aml.dper2.canary:9efc576b35b24361bb600dcbf94d31ea`.

Baseline.1 (no L2 regularization): f193105372
Baseline.2 (L2 regularization in prod): f193105522
Treatment (skipping L2 regularization on 1d sparse params): f193105708

Reviewed By: zhongyx12

Differential Revision: D21757902

fbshipit-source-id: ced126e1eab270669b9981c9ecc287dfc9dee995
2020-06-11 11:21:25 -07:00
da3073e9b1 Revert D21960728: Include AT_PARALLEL_OPENMP/AT_PARALLEL_NATIVE/AT_PARALLEL_NATIVE_TBB to ATen/Config.h
Test Plan: revert-hammer

Differential Revision:
D21960728

Original commit changeset: dad5da902b97

fbshipit-source-id: ca1757409cb3ffa7e069699276dc0363f6879f97
2020-06-11 08:29:57 -07:00
aaaf2eb6b3 Add batching rule for torch.sum(tensor, dims) (#39581)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39581

Context: Batching rules
------------------------------------

Batching rules take BatchedTensors and regular Tensors as arguments. A
batching rule generally does the following:

1. Converts (logical) BatchedTensors to views on physical tensors.
2. Converts logical arguments (e.g. dimension indexes, shapes) to
physical arguments that correspond to the physical tensors.
3. Calls at:: operations on the physical tensors and arguments.
4. Converts physical results back to BatchedTensors.

Steps 1 and 2 differ for operators with different batching behaviors.
(see next section)

VmapTransform abstraction
------------------------------------
(Previously known as a "Converter". Bikeshedding welcome on the naming).

An ArgTransform converts logical views of tensors to physical views.
When
writing a batching rule, users should select the ArgTransform that
matches
the batching behavior of their operator. If the batching behavior of the
op is complicated, then they’ll have to write some custom logic (either
by writing a new ArgTransform, or writing the logical->physical
transform
themselves).

*56% (~474) of (vmap-supported) operators can and will use these
VmapTransform. 20% (~168) of operators need custom handling*.

See `VmapTransforms.h` for more context.

PhysicalView
------------------------------------

VmapTransforms return physical views on tensors, represented by the
PhysicalView struct. It is effectively a Tensor and contains
enough metadata to enable mapping logical non-tensor arguments to
physical non-tensor arguments, and the other way around.

There are two methods on PhysicalView right now:
- `PhysicalView::getPhysicalDim(logical_dim)` and
`PhysicalView::getPhysicalDims(logical_dims)`.
are used to map logical dims to physical dims.
- `PhysicalView::newLogicalFromPhysical(Tensor)` is used to map a result
physical tensor from a batching rule to a logical tensor
(BatchedTensor).

Test Plan:
------------------------------------
- `./build/bin/vmap_test`

Differential Revision: D21983789

Pulled By: zou3519

fbshipit-source-id: dc558e05b596fd29f9643e933e4ece4b7866b6db
2020-06-11 08:00:12 -07:00
2d1cf950bb Impose maximum level restriction for BatchedTensors (#39580)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39580

We support 64 total levels. This is done so that we can represent lists
of levels as a bitset that fits into a single `int64_t` and is a
reasonable upper bound because we only support (physical) tensors of up
to 64 dimensions with vmap (see kVmapMaxTensorDims).

Test Plan:
`./build/bin/vmap_test`. One day we'll test this with the vmap Python
API.

Differential Revision: D21929249

Pulled By: zou3519

fbshipit-source-id: 2e99c0c519d6ab0c063fda20f4a0b1f53da6d450
2020-06-11 07:58:35 -07:00
1360bb986c Revert D21976091: [vulkan][CI] CI android build abi x86 with USE_VULKAN
Test Plan: revert-hammer

Differential Revision:
D21976091

Original commit changeset: cb9fa5612cfe

fbshipit-source-id: 00f41068c6b2c62473ba9c4026e0122c331b54bf
2020-06-11 07:24:31 -07:00
ba27fd04d3 Fixes type promotion for cat (#39777)
Summary:
Fixes a bug introduced in  https://github.com/pytorch/pytorch/issues/35030

Changes the test to do all the possible type combinations.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39777

Differential Revision: D21975165

Pulled By: albanD

fbshipit-source-id: 6d59cfac4f1abe021f8b489454c1c176e7893ecd
2020-06-11 05:52:56 -07:00
c3d4053bc0 [quant][graphmode] Support quantized::batch_norm2d_relu fusion for tracing (#39645)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39645

This PR added quantization support for handling BatchNorm2d and ReLU(or F.relu) in both
scripting and tracing

Test Plan:
python test/test_quantization.py TestQuantizeScriptPTSQOps.test_qbatchnorm_relu

Imported from OSS

Differential Revision: D21942111

fbshipit-source-id: 680e16076a37b96d2485d5cbc39ce9a045c319c3
2020-06-10 23:32:59 -07:00
e1392922f2 [quant] Enable per-channel quantization for LSTM Modules (#39666)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/39666

Test Plan:
python test/test_quantization.py TestPostTrainingDynamic.test_per_channel_lstm_quantize

Imported from OSS

Differential Revision: D21977601

fbshipit-source-id: 1333259e75782e54864ab444e05397b86cd9b9aa
2020-06-10 23:19:08 -07:00
425927bb2b [quant] Add reduce_range params for quantized_lstm (#39604)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39604

This change preserves BC for older models that are saved with reduce_range set to false.
Newer models will use the version information in RNN module to toggle reduce_range parameter

Internally this is implemented using a new CellParams type that calls the linear functions with reduce_range option set to true.
New models serialized will use the CellParams struct for the `__getstate__` and `__setstate__` calls. Older models using QuantizedCellParamsDynamic will continue to use their original serialization/de-serialization methods

tested using LSTM BC test and test_quantized_rnn

Test Plan:
python test/test_quantization.py

Imported from OSS

Differential Revision: D21977600

fbshipit-source-id: 0cb0e098b87207b537574d3beeab1f341c41c0d2
2020-06-10 23:16:57 -07:00
e399e470b6 [vulkan] speed_becnhmark_torch add vulkan arg to use Vulkan backend (#39076)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39076

`--vulkan` argument to use torch benchmark on Vulkan Backend
if it is True - inputs will be converted to Vulkan backend before module.forward

Usage for mobilenetv2 fp32:
```
/build/bin/speed_benchmark_torch --model=mn-fp32.pt --input_type=float --input_dims=1,3,224,224 --warmup=1 --iter=5 --vulkan=true
```

Test Plan: Imported from OSS

Differential Revision: D21962428

Pulled By: IvanKobzarev

fbshipit-source-id: 3136af5386b6bce9ea53ba4a9019af2d312544b3
2020-06-10 22:19:22 -07:00
a752832da9 Fix Tensor.tolist signature in the docstring (#39732)
Summary:
- Before

    ![image](https://user-images.githubusercontent.com/6421097/84171548-8b4e1300-aa40-11ea-90e2-d75a672979d7.png)

- After

    ![image](https://user-images.githubusercontent.com/6421097/84171953-01527a00-aa41-11ea-933c-a6b7a02100ea.png)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39732

Differential Revision: D21984956

Pulled By: ngimel

fbshipit-source-id: fa02776cbe0aa27d4da21a34aae191b491199c28
2020-06-10 22:00:14 -07:00
5d2f6d86e5 graph mode: add quantization type enum (#39795)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39795

Replaces the `is_dynamic` bool by enums in Python and c++
graph quantization code.  This makes the code more readable
and will make it easier to modify for adding QAT logic in the future.

Test Plan:
CI, as well as
```
python test/test_quantization.py TestQuantizeDynamicScript
python test/test_quantization.py TestQuantizeScriptJitPasses
```

Imported from OSS

Differential Revision: D21981643

fbshipit-source-id: d475760407bcc794aeae92a2c696bac4acda843d
2020-06-10 21:34:23 -07:00
94dfc76e3f graph mode qat: make fake_quantize scriptable (#39750)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39750

Add a test to make the default QAT qconfig scriptable, and fix
all the errors.

Test Plan:
```
python test/test_quantization.py TestQATScript.fake_quant_scriptable
```

Imported from OSS

Differential Revision: D21975879

fbshipit-source-id: 8c48ad9f24b2c941d2267cb53eb70ebecd103744
2020-06-10 21:34:18 -07:00
5c10b17491 graph mode: more docs for insert observers pass (#39739)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39739

Adding more docs and examples to make code reasing easier for newcomers.

Test Plan:
CI, no logic changes

Imported from OSS

Differential Revision: D21975878

fbshipit-source-id: 464858c0490cfbdec165a5ddf3817ca4878abb09
2020-06-10 21:34:13 -07:00
f8561acb13 graph mode: add docs to pre-calibration passes (#39683)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39683

Adding a couple of docstrings to `_jit_pass_dedup_module_uses` and
`_jit_pass_insert_observers`.

Test Plan:
CI, no logic change

Imported from OSS

Differential Revision: D21975880

fbshipit-source-id: 8876e0e981d6675bce08fa8e08ac7a3d38c3c622
2020-06-10 21:32:25 -07:00
4c4b9916ef Include AT_PARALLEL_OPENMP/AT_PARALLEL_NATIVE/AT_PARALLEL_NATIVE_TBB to ATen/Config.h (#39612)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39612

Fixes #39471

Manually tested with https://github.com/pytorch/csprng

Test Plan: Imported from OSS

Reviewed By: ezyang, ljk53

Differential Revision: D21960728

Pulled By: pbelevich

fbshipit-source-id: dad5da902b97a080753482ec5c293eee9bba89c8
2020-06-10 21:04:44 -07:00
97dfdaaad8 torch.multinomial : fast-path for replacement=False (#39742)
Summary:
Benchmark with same build settings on same system.
gcc : version 7.5.0 (Ubuntu 7.5.0-3ubuntu1~18.04)
CUDA : 10.1
GPU : 1050ti

```python
import time
import torch
import numpy as np

for n, t in [(500_000, 10),
             (1_000_000, 10)]:
    for dtype in (torch.half, torch.float, torch.double):
        # Input Setup
        p = torch.from_numpy(np.random.rand(n)).to(dtype)
        want = 1000
        print(f'torch.multinomial(a) a.numel() == {n} for {t} times {dtype}')
        start = time.time()
        # Iterate
        for _ in range(t):
            torch.multinomial(p, want, replacement=False)
        print(f'Took:', time.time() - start)

print('****' * 10)

for n, t in [(50_000, 100),
             (100_000, 100)]:
    for dtype in (torch.half, torch.float, torch.double):
        # Input Setup
        p = torch.rand(n, device='cuda', dtype=dtype)
        want = 1000
        print(f'torch.multinomial(a) a.numel() == {n} for {t} times {dtype}')
        start = time.time()
        # torch.cuda.synchronize()
        # Iterate
        for _ in range(t):
            torch.multinomial(p, want, replacement=False)
        # torch.cuda.synchronize()
        print(f'CUDA Took:', time.time() - start)
```

Before:

```
torch.multinomial(a) a.numel() == 500000 for 10 times torch.float16
Took: 80.64455389976501
torch.multinomial(a) a.numel() == 500000 for 10 times torch.float32
Took: 3.7778031826019287
torch.multinomial(a) a.numel() == 500000 for 10 times torch.float64
Took: 5.045570611953735
torch.multinomial(a) a.numel() == 1000000 for 10 times torch.float16
Took: 161.53191947937012
torch.multinomial(a) a.numel() == 1000000 for 10 times torch.float32
Took: 7.640851736068726
torch.multinomial(a) a.numel() == 1000000 for 10 times torch.float64
Took: 10.399673461914062
****************************************
torch.multinomial(a) a.numel() == 50000 for 100 times torch.float16
CUDA Took: 4.873984098434448
torch.multinomial(a) a.numel() == 50000 for 100 times torch.float32
CUDA Took: 4.713594436645508
torch.multinomial(a) a.numel() == 50000 for 100 times torch.float64
CUDA Took: 11.167185068130493
torch.multinomial(a) a.numel() == 100000 for 100 times torch.float16
CUDA Took: 7.195427417755127
torch.multinomial(a) a.numel() == 100000 for 100 times torch.float32
CUDA Took: 7.669712066650391
torch.multinomial(a) a.numel() == 100000 for 100 times torch.float64
CUDA Took: 20.20938801765442
```

After:

```
torch.multinomial(a) a.numel() == 500000 for 10 times torch.float16
Took: 81.09321522712708
torch.multinomial(a) a.numel() == 500000 for 10 times torch.float32
Took: 0.06062650680541992
torch.multinomial(a) a.numel() == 500000 for 10 times torch.float64
Took: 0.0862889289855957
torch.multinomial(a) a.numel() == 1000000 for 10 times torch.float16
Took: 161.85304307937622
torch.multinomial(a) a.numel() == 1000000 for 10 times torch.float32
Took: 0.13271093368530273
torch.multinomial(a) a.numel() == 1000000 for 10 times torch.float64
Took: 0.17215657234191895
****************************************
torch.multinomial(a) a.numel() == 50000 for 100 times torch.float16
CUDA Took: 0.035035133361816406
torch.multinomial(a) a.numel() == 50000 for 100 times torch.float32
CUDA Took: 0.03631949424743652
torch.multinomial(a) a.numel() == 50000 for 100 times torch.float64
CUDA Took: 0.05507040023803711
torch.multinomial(a) a.numel() == 100000 for 100 times torch.float16
CUDA Took: 0.05105161666870117
torch.multinomial(a) a.numel() == 100000 for 100 times torch.float32
CUDA Took: 0.05449223518371582
torch.multinomial(a) a.numel() == 100000 for 100 times torch.float64
CUDA Took: 0.09161853790283203
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39742

Differential Revision: D21976915

Pulled By: ngimel

fbshipit-source-id: 34431f814f31b6dfd6179a89f8e4fa574da7a306
2020-06-10 20:42:55 -07:00
780fa2b489 Switch torch.save to zipfile serialization and swap quantization to that (#39460)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/39460

Test Plan: Imported from OSS

Differential Revision: D21865748

Pulled By: jamesr66a

fbshipit-source-id: 90fddf366fcb3030e09ed79fb3e038f0175875a5
2020-06-10 17:19:55 -07:00
262dbdf0a5 [caffe2/nomnigraph] handle when PATH env is not defined (#39373)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39373

Line 114 is the only actual change. Other changes are just formatting.

Test Plan: CI

Reviewed By: zrphercule

Differential Revision: D21830893

fbshipit-source-id: 83e49b1b3c48f6bc6de3c48ccce60c84aa49339b
2020-06-10 17:09:59 -07:00
96870181c6 Remove duplicated entries in random.rst (#39725)
Summary:
In the current master doc, every function under [`torch.random`](https://pytorch.org/docs/master/random.html) appears twice because the function docs are generated by both `automodule` and `autofunction`.

This PR removes the parts generated by `autofunction`.

See changed docs at https://5751500-65600975-gh.circle-artifacts.com/0/docs/random.html:

![image](https://user-images.githubusercontent.com/6421097/84165823-bf720580-aa39-11ea-9149-c428d43260f8.png)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39725

Differential Revision: D21983701

Pulled By: ngimel

fbshipit-source-id: 5f515d7fd8034687e754e2c7b2ea9e154b3ea9b9
2020-06-10 16:51:15 -07:00
be838504a3 Remove THTensor_(fill) & THTensor_(zero) (#39727)
Summary:
Remove `THTensor_(fill)` & `THTensor_(zero)` following the PR https://github.com/pytorch/pytorch/pull/39042 as reference
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39727

Differential Revision: D21980423

Pulled By: ngimel

fbshipit-source-id: bfeeaebe93d9ff465c7ad21c872cad8f8399537d
2020-06-10 15:06:04 -07:00
bfa76ff407 [Doc] Clarify that variance estimor is biaised for normalization layers (#39752)
Summary:
Closes https://github.com/pytorch/pytorch/issues/39330
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39752

Differential Revision: D21980097

Pulled By: ngimel

fbshipit-source-id: 2bdcb8bf8194768985f5a8787712d215c0c5c1ec
2020-06-10 14:44:44 -07:00
95489b590f Throws runtime error when performing integer division using torch.div (#38620)
Summary:
**1.6 Deprecation Note**

In PyTorch 1.6 attempting to divide two integer tensors or an integer tensor and an integer scalar will throw a runtime error. This behavior was deprecated with a warning in PyTorch 1.5. In PyTorch 1.7 torch.div and the division operator will always perform true division like Python3 and NumPy.

To divide integer values use either torch.true_divide, for true division, or torch.floor_divide (the // operator) for floor division.

**PR Summary**

This PR updates the warning message when performing integer division to be a runtime error. Because some serialized Torchscript programs may rely on torch.div's historic behavior it also implements a "versioned symbol" for div that lets those models retain their current behavior. Extensive tests of this behavior are the majority of this PR.

Note this change bumps the produced file format version to delineate which programs should have their historic div behavior preserved.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38620

Differential Revision: D21612598

Pulled By: mruberry

fbshipit-source-id: c9c33591abce2f7e97f67f0f859901f5b03ed47d
2020-06-10 13:59:34 -07:00
cb519801d6 [vulkan][CI] CI android build abi x86 with USE_VULKAN (#39767)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39767

Adding android build for every PR with `-DUSE_VULKAN=ON` It will use vulkan from ANDROID_NDK, so no changes for docker images.

Test Plan: Imported from OSS

Differential Revision: D21976091

Pulled By: IvanKobzarev

fbshipit-source-id: cb9fa5612cfebc02dfd4946e50faa121311780f7
2020-06-10 13:55:47 -07:00
a5fbd3ef8a [vulkan][build_fix] Fix Vulkan Build; Prepacking uses new register api (#39771)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39771

Vulkan build was not integrated with CI, it fails without this change.
There were 2 separate problems
1.  Recently added aten/src/ATen/templates/Functions.cpp missed VulkanType in header

2. Applying the new registration api, similar to xnnpack change
https://github.com/pytorch/pytorch/pull/36800

Test Plan:
`ANDROID_ABI=x86 ./scripts/build_android.sh -DUSE_VULKAN=ON` builds ok

CI integration for it is in the next PR in this stack ( https://github.com/pytorch/pytorch/pull/39767 )
job `ci/circleci: pytorch_linux_xenial_py3_clang5_android_ndk_r19c_x86_32_vulkan_build`

Differential Revision: D21975992

Pulled By: IvanKobzarev

fbshipit-source-id: b0400a9cb0ae90d7763ebeb5b8f7ee932a2148e1
2020-06-10 13:54:12 -07:00
7f55197a57 Peel Loop (#39434)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/39434

Differential Revision: D21857037

Pulled By: Krovatkin

fbshipit-source-id: 6583da167fe93d96e93f1c3d71f46f94e7f4e982
2020-06-10 13:48:18 -07:00
a1071e5d36 Fix parsing of subscript expressions using python resolver (#39269)
Summary:
- add call out to python resolver in parseArgsFromDecl, parserReturnFromDecl
- add support in python resolver for nested subexpressions
- wrap python resolver call in exception handling to fall back to c++ path
- add tests for newly resolvable types
- closes https://github.com/pytorch/pytorch/issues/38728

Fixes bug where SourceRange objects did not include the final closing ']' for a subscript expression.  E.g. range for 'List[int]' previously included only 'List[int'.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39269

Differential Revision: D21956402

Pulled By: wconstab

fbshipit-source-id: 5d783260322eb1e04e20bc931a8e9d9179765f13
2020-06-10 13:30:15 -07:00
6748fbd38a Remove MultiheadAttention weights from constants (#39768)
Summary:
The weights of the `MultiheadAttention` were incorrectly listed as constants, which produced warnings when converting to a TorchScript module.

```py
import torch
import torch.nn as nn

multihead_attn = nn.MultiheadAttention(256, 4)

torch.jit.script(multihead_attn)
```

Warnings:

```
/home/michael/.local/lib/python3.8/site-packages/torch/jit/_recursive.py:151: UserWarning: 'q_proj_weight' was found in ScriptModule constants,  but it is a non-constant parameter. Consider removing it.
  warnings.warn("'{}' was found in ScriptModule constants, "
/home/michael/.local/lib/python3.8/site-packages/torch/jit/_recursive.py:151: UserWarning: 'k_proj_weight' was found in ScriptModule constants,  but it is a non-constant parameter. Consider removing it.
  warnings.warn("'{}' was found in ScriptModule constants, "
/home/michael/.local/lib/python3.8/site-packages/torch/jit/_recursive.py:151: UserWarning: 'v_proj_weight' was found in ScriptModule constants,  but it is a non-constant parameter. Consider removing it.
  warnings.warn("'{}' was found in ScriptModule constants, "
/home/michael/.local/lib/python3.8/site-packages/torch/jit/_recursive.py:151: UserWarning: 'in_proj_weight' was found in ScriptModule constants,  but it is a non-constant parameter. Consider removing it.
  warnings.warn("'{}' was found in ScriptModule constants, "
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39768

Reviewed By: zhangguanheng66

Differential Revision: D21977032

Pulled By: ngimel

fbshipit-source-id: c2c3d0605a51324a9541f5a2caca7ab7a518dc00
2020-06-10 13:23:48 -07:00
3fb1e73a4e Add rpc.async_execution support for rpc.remote on script functions (#39758)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/39758

Test Plan: Imported from OSS

Differential Revision: D21963789

Pulled By: mrshenli

fbshipit-source-id: f16f464ba01401b160cc4d3daf036e4bc806d7ea
2020-06-10 13:17:07 -07:00
eb7843ed01 [quantization] Remove duplicated piece of code in test (just a nit). (#39770)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39770

Remove duplicated piece of code in test (just a nit).

Test Plan: buck test test:quantization

Reviewed By: supriyar

Differential Revision: D21967877

fbshipit-source-id: 48a2d60e108fb9ddfa98e30888cf45744905277d
2020-06-10 12:51:42 -07:00
d04a3fcc42 Refactor CUDA bernoulli_kernel by using uniform_and_transform (#39652)
Summary:
- Fixes https://github.com/pytorch/pytorch/issues/39557 .
- Related https://github.com/pytorch/pytorch/issues/38558 .
- Simplifed `void bernoulli_kernel(TensorIterator& iter, double p_, RNG gen)` in `cuda/DistributionTemplates.h` by using `uniform_and_transform`.
- Unified `void bernoulli_kernel(TensorIterator& iter, double p_, RNG gen)` with other kernels in `cuda/DistributionTemplates.h`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39652

Differential Revision: D21974529

Pulled By: pbelevich

fbshipit-source-id: 5bbc06350714f4e72dc6ea8a0016769551610a52
2020-06-10 12:49:59 -07:00
68b8740611 Update TensorPipe submodule (#39783)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39783

This is needed to pick up the new pipe method used in https://github.com/pytorch/pytorch/pull/39781.

Test Plan: CircleCI

Reviewed By: patricklabatut

Differential Revision: D21974131

fbshipit-source-id: 4b74064279ad4881cbd95e408423566a1cd62c2a
2020-06-10 12:41:32 -07:00
c22bbb2124 [JIT] Add Type::repr_str to return human-readable str (#39544)
Summary:
Clearly expressing a type is inferred by PyTorch instead of explicitly annotated by user makes many error messages more user-friendly

Currently Type has two string conversion methods. str() for IR printing and python_str() for serialization and error message generation. If we want to include more information in type printing while maintaining serialization/deserialization correctness, we need to split python_str() into annotation_str() and repr_str().

annotation_str is solely responsible for serialization, it strictly matches format of python type annotation. repr_str() is responsible for generating a human-readable error message that includes information like "this type is inferred, not explicitly annotated"

Closes https://github.com/pytorch/pytorch/issues/39449
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39544

Differential Revision: D21978759

Pulled By: gmagogsfm

fbshipit-source-id: 733566f5a62e748b5ca4bb3c5943ebb6d5b664d0
2020-06-10 12:01:24 -07:00
4e892bd99c [Easy Review] Fix ProcessGroupRpcBackendOptions Doc (#39787)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/39787

Test Plan: Imported from OSS

Reviewed By: lw

Differential Revision: D21975590

Pulled By: mrshenli

fbshipit-source-id: b0f9c3b15fdd32df8f64ea64b05fd65ec793f2b2
2020-06-10 11:44:33 -07:00
7994d6e147 [quant][graphmode] Support quantization for aten::append (#39644)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/39644

Test Plan: Imported from OSS

Differential Revision: D21942112

fbshipit-source-id: 8dc5871cbde9e9cc161a624c2b07e2e74bc0ff6d
2020-06-10 11:29:27 -07:00
08105a0068 Remove unnecessary !op.is_read_write test in compute_names/compute_shape. (#39747)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39747

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Differential Revision: D21961336

Pulled By: ezyang

fbshipit-source-id: 6c7b3ccebd8f95a04994d53e5b5e9471bfefb26b
2020-06-10 10:48:59 -07:00
b1750cb884 always resize_ min/max outputs (#39696)
Summary:
Resubmit of https://github.com/pytorch/pytorch/issues/36474. Addresses the comment in https://github.com/pytorch/pytorch/issues/35591, and makes behavior with `out` kwarg consistent with other functions that `resize_` their passed out.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39696

Differential Revision: D21947473

Pulled By: ngimel

fbshipit-source-id: 072f2bd156cc55b87ed64cf47c44b527e5e0b82c
2020-06-10 10:09:05 -07:00
9ba2530d42 [ROCm] explicitly embed version within image name (#39735)
Summary:
This commit also removes the clang7 install for ROCm images, and properly cleans up the apt cache after ROCm installation to reduce image sizes.

Embedding the ROCm version within the image name follows the precedent set by CUDA images and decouples image creation from ROCm implicitly installing the latest version when images are prepared.

CC sunway513 ezyang
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39735

Differential Revision: D21976162

Pulled By: ezyang

fbshipit-source-id: 9801106e8cb118a812113ec077154e72a9c2eb2d
2020-06-10 10:01:10 -07:00
2d589bc9da [quant][graphmode] Fix a corner case in handling if in insert_observers (#39615)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39615

corner case: https://github.com/pytorch/vision/blob/master/torchvision/models/densenet.py#L87

(Note: this ignores all push blocking failures!)

Test Plan: Imported from OSS

Differential Revision: D21942110

fbshipit-source-id: 9522032957575662d2648db2a41fa5410e8d1e3a
2020-06-10 09:45:07 -07:00
0aecbbb762 Changes TensorIterator computation to not consider out kwarg, lets UnaryOps safe cast to out (#39655)
Summary:
**BC breaking note:**

In PyTorch 1.5 passing the out= kwarg to some functions, like torch.add, could affect the computation. That is,

```
out = torch.add(a, b)
```

could produce a different tensor than

```
torch.add(a, b, out=out)
```

This is because previously the out argument participated in the type promotion rules. For greater consistency with NumPy, Python, and C++, in PyTorch 1.6 the out argument no longer participates in type promotion, and has no effect on the computation performed.

**ORIGINAL PR NOTE**

This PR effectively rewrites Tensor Iterator's "compute_types" function to both clarify its behavior and change how our type promotion works to never consider the out argument when determining the iterator's "common dtype," AKA its "computation type." That is,

```
a = op(b, c)
```

should always produce the same result as

```
op(b, c, out=a)
```

This is consistent with NumPy and programming languages like Python and C++.

The conceptual model for this change is that a TensorIterator may have a "common computation type" that all inputs are cast to and its computation performed in. This common computation type, if it exists, is determined by applying our type promotion rules to the inputs.

A common computation type is natural for some classes of functions, like many binary elementwise functions (e.g. add, sub, mul, div...). (NumPy describes these as "universal functions.") Many functions, however, like indexing operations, don't have a natural common computation type. In the future we'll likely want to support setting the TensorIterator's common computation type explicitly to enable "floating ufuncs" like the sin function that promote integer types to the default scalar type. Logic like that is beyond the type promotion system, which can only review inputs.

Implementing this change in a readable and maintainable manner was challenging because compute_types() has had many small modifications from many authors over ~2 year period, and the existing logic was in some places outdated and in other places unnecessarily complicated. The existing "strategies" approach also painted with a broad brush, and two of them no longer made conceptual sense after this change. As a result, the new version of this function has a small set of flags to control its behavior. This has the positive effect of disentangling checks like all operands having the same device and their having the same dtype.

Additional changes in this PR:

- Unary operations now support out arguments with different dtypes. Like binary ops they check canCast(computation type, out dtype).
- The dtype checking for lerp was outdated and its error message included the wrong variable. It has been fixed.
- The check for whether all tensors are on the same device has been separated from other checks. TensorIterators used by copy disable this check.
- As a result of this change, the output dtype can be computed if only the input types are available.
- The "fast path" for checking if a common dtype computation is necessary has been updated and simplified to also handle zero-dim tensors.
- A couple helper functions for compute_types() have been inlined to improve readability.
- The confusingly named and no longer used promote_gpu_output_dtypes_ has been removed. This variable was intended to support casting fp16 reductions on GPU, but it has become a nullop. That logic is now implemented here: 856215509d/aten/src/ATen/native/ReduceOpsUtils.h (L207).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39655

Differential Revision: D21970878

Pulled By: mruberry

fbshipit-source-id: 5e6354c78240877ab5d6b1f7cfb351bd89049012
2020-06-10 09:04:13 -07:00
acc13ac828 [PyTorch] Make DDP reducer work under distributed autograd (#37998)
Summary:
## Why doesn’t DDP work under dist_autograd?
DDP follows the steps below
1. [DDP Python constructor](8d6a8d2b3f/torch/nn/parallel/distributed.py (L389-L393)) (on a module) creates a [C++ Reducer](https://github.com/pytorch/pytorch/blob/master/torch/csrc/distributed/c10d/reducer.cpp), which holds references to all parameters (or variables in C++ code).
2. The reducer installs a post hook on each model parameter.
3. The backward run starts and triggers the post hooks installed above.
4. The post hook of a parameter simply marks the parameter ready for all-reduce.
5. Once all parameters in a bucket are ready, an all-reduce process starts by reading variable `.grad` and writes to variable `.grad`.

But under dist_autograd, `.grad` of a variable is not populated at all. Instead, grads are in a global map in distributed context from variables to their grads.

## Solution of this PR
The distributed engine to set a thread_local variable in a backward run indicating we're running in distributed mode. DDP reducer can then appropriately use `.grad` or the distributed context based on the thread local. More precisely, the thread local is set before calling the post hooks installed by the DDP reducer so that DDP post hooks can retrieve this thread local.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37998

Test Plan:
```
python test/distributed/test_ddp_under_dist_autograd.py
```

FB repo
```
buck test caffe2/test/distributed/...
```

DDP accuracy benchmark workflow run

```
flow-cli canary pytorch.benchmark.accuracy_comparison.workflow --parameters-json '{"node_world_size": 4, "dist_backend": "nccl"}' --run-as-secure-group fblearner_flow --entitlement gpu_prod
```

f196173157

Reviewed By: pritamdamania87

Differential Revision: D21513795

Pulled By: hczhu

fbshipit-source-id: fe21e68ecdc9274182db4d4bb5a1e2d68ef927a2
2020-06-10 08:38:14 -07:00
7cb4eae8b1 correct some cpp extension code usages and documents (#39766)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/39766

Test Plan: Imported from OSS

Differential Revision: D21967284

Pulled By: glaringlee

fbshipit-source-id: 8597916bee247cb5f8c82ed8297119d2f3a72170
2020-06-10 08:31:22 -07:00
be13388adb Migrate AT_DISPATCH_COMPLEX_TYPES to c10::complex (#39564)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/39564

Differential Revision: D21954741

Pulled By: anjali411

fbshipit-source-id: 1280b5a11732b31a240f8bc129cc09298e61e419
2020-06-10 06:33:53 -07:00
9111ae7782 Preserve user specified attributes and methods (#38830)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38830

This patch enables to preserve user specified attributes or non forward
methods. The API:
  _freeze_module(Module, ["a", "version"])

Test Plan: Imported from OSS

Differential Revision: D21957316

Pulled By: bzinodev

fbshipit-source-id: 5c9146ae679791070a9de868c45785725b48a9e6
2020-06-10 01:38:18 -07:00
6bdfd6ae1a [TensorExpr] Fast sigmoid for LLVM (#39717)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/39717

Test Plan: Imported from OSS

Reviewed By: ZolotukhinM

Differential Revision: D21949849

Pulled By: zheng-xq

fbshipit-source-id: f918bb2cb0ea647ce254fc51258af6fd01325f2d
2020-06-09 20:11:35 -07:00
307920731d [iOS] Add nonVarTypeModeGuard to fix the unit test (#39743)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39743

### Summary

Still need this RAII guard for full JIT

### Test Plan

- CI checks

Test Plan: Imported from OSS

Differential Revision: D21968256

Pulled By: xta0

fbshipit-source-id: 8ea63c699fed4e2a01390232a58f039110391844
2020-06-09 20:05:32 -07:00
2193fa119e [JIT] consider side effects when trying moves in alias analysis (#39497)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39497

Previously, we didn't consider side effects at all when moving nodes in alias analysis. It is never valid to reorder a node with a side effect. This has led to bugs when used with Bailouts.

Unfortunately this will might cause regressions but it wasn't correct prior :/

Test Plan: Imported from OSS

Differential Revision: D21963774

Pulled By: eellison

fbshipit-source-id: 656995d1b82534eca65437ed4e397b2bf08a4dec
2020-06-09 19:32:55 -07:00
3cf9b7d9ea move mm_cpu from BlasWrappersCPU.cpp to LinearAlgebra.cpp and delete the former file (#39700)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39700

Refactored files
1. Moved mm_cpu from BlasWrappersCPU.cpp to LinearAlgebra.cpp
2. Deleted BlasWrappersCPU.cpp

These functions are closely related to those in LinearAlgebra.cpp, we don't need a seperate file.
ghstack-source-id: 105503249

Test Plan:
`buck build //caffe2/aten/...`
`buck build //xplat/caffe2:ptmobile_benchmarkAndroid#android-armv7`
CI

Reviewed By: dreiss

Differential Revision: D21692154

fbshipit-source-id: 4edb7cee53c9e29700372f16ca1e6f85539dac24
2020-06-09 19:15:02 -07:00
1b99be9088 Freezing Module containing fork subgraphs (#37044)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37044

This patch applies freezing to module containing fork subgraphs.

Test Plan: Imported from OSS

Differential Revision: D21956334

Pulled By: bzinodev

fbshipit-source-id: ec272ddea1ed588c35d8ffa4ea9b532d5336667f
2020-06-09 19:10:38 -07:00
0fe1ec3ce0 [quant][graphmode] Test weight observer for dynamic quant (#39687)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39687

Run the observer on the weight values and compare it with the calculated attributes in the graph

Test Plan:
python test/test_quantization.py test_dynamic_weight_observer

Imported from OSS

Differential Revision: D21961907

fbshipit-source-id: dde3e629b8514e6c82346915ac35e35cf9c05f6f
2020-06-09 17:41:57 -07:00
2a06a6935c [quant][graphmode] Support propagate dequantize for nodes with multiple outputs (#39551)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39551

e.g. prim::ListUnpack

Test Plan: Imported from OSS

Differential Revision: D21942108

fbshipit-source-id: 6fd7c972ca70692ec52c296b6a1e858324e66c12
2020-06-09 17:31:16 -07:00
e46060701d [caffe2] Fix of initializing ATen's CUDA before using caching allocator (#39759)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39759

Caffe2 has a mode where it uses PT's caching allocator. Somehow we were not calling the initialization explicitly.

Now, I have no idea why it worked before. Probably worth to run a bisect separately.

Reviewed By: houseroad

Differential Revision: D21962331

fbshipit-source-id: f16ad6b27a67dbe0bda93939cca8c94620d22a09
2020-06-09 17:25:42 -07:00
56289ba31f [JIT] Improve error message when type annotation Future without a contained type (#39751)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/39751

Test Plan: Imported from OSS

Reviewed By: eellison

Differential Revision: D21960368

Pulled By: jamesr66a

fbshipit-source-id: 8650d31ff8070b12672c8d4b0224d4e69f619938
2020-06-09 16:55:13 -07:00
4c5a808d37 avoid dividing by 0 in div unit test (#39736)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39736

in some rare cases we can end up generating a random number = 0

Test Plan: test_div

Reviewed By: yinghai

Differential Revision: D21953973

fbshipit-source-id: a834f624d72f1084c300163344662df121aae21b
2020-06-09 16:39:19 -07:00
be3bbfc917 [futures] Add collectAny() to ivalue::Future for completeness (#39597)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39597

To complement collectAll(), this change adds collectAny(), and writes
up relevant unittest coverage.

We also remove the vector-based helper version of collectAll(), which
was debatable usefulness in a previous change.
ghstack-source-id: 105527180

Test Plan: buck test mode/dev-nosan caffe2/test/cpp/jit/...

Differential Revision: D21910311

fbshipit-source-id: dbb3ca404672a3d751b1b3cf016e6084a9ff8040
2020-06-09 16:32:52 -07:00
bccf8831b8 Allow initializing TestCase() outside of unittest.main() (#39695)
Summary:
When debugging it is sometimes useful to call test code manually.  This change makes that easier.

Before this change, one would get the following error:
```
$ python -c "from torch.testing._internal.jit_utils import JitTestCase; JitTestCase()"
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/home/jansel/pytorch/torch/testing/_internal/common_utils.py", line 740, in __init__
    test_method = getattr(self, method_name)
AttributeError: 'JitTestCase' object has no attribute 'runTest'
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39695

Test Plan: `python -c "from torch.testing._internal.jit_utils import JitTestCase; JitTestCase()"`

Differential Revision: D21959249

Pulled By: jansel

fbshipit-source-id: 8435249f102338c957c3a7a7aad48d21d372a8cf
2020-06-09 15:59:36 -07:00
c902146ba4 [quant][graphmode][refactor] propagateQuantizationOps (#39550)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39550

This is to prepare for next PR that fixes propagate dequantize for ops with multiple outputs

Test Plan: Imported from OSS

Differential Revision: D21942063

fbshipit-source-id: 518b3e437140bec9620988d2eb59b6aae069245e
2020-06-09 15:48:20 -07:00
428bc90978 [JIT] add dtype as type annotation (#39741)
Summary:
make torch.dtype resolve as type annotation
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39741

Reviewed By: jamesr66a

Differential Revision: D21956469

Pulled By: eellison

fbshipit-source-id: 492acd7403fa827a2e2c87fd08d31450fcb3a45e
2020-06-09 15:01:00 -07:00
4e30146368 Use ProgramFiles environment variable on Windows (#39707)
Summary:
'Program Files' does not have to be on disk C (nor necesserily should
be called `Program Files`)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39707

Differential Revision: D21954235

Pulled By: malfet

fbshipit-source-id: 91a9b765cd1bc7e6201dd4b800d45257207010d9
2020-06-09 14:55:52 -07:00
0f39ed86a7 Cleanup debug info switches with MSVC (#39703)
Summary:
Switch off `/Z7` so that we don't generate debug info in Release and MinSizeRel builds, so that we will probably get smaller static libraries and object files and faster build time
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39703

Differential Revision: D21960684

Pulled By: ezyang

fbshipit-source-id: 909a237a138183591d667885b13fc311470eed65
2020-06-09 14:11:40 -07:00
f1e6e56641 Add aarch64 ci badge (#39698)
Summary:
This PR added a third-party aarch64 CI badge. It's CPU only currently for building pytorch master branch on python3.6 and Ubuntu 18.04. This CI is provided by OpenLab[1]

The build job runs once everyday at UTC0000

You can preview the badge here[2]

The build failed because of a known issue: https://github.com/pytorch/pytorch/issues/33124

More python version and GPU support will be added in the future.

This fixes pytorch/pytorch#39558.

1: https://openlabtesting.org/
2: https://github.com/wangxiyuan/pytorch/tree/add_aarch64_ci_badge
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39698

Differential Revision: D21960607

Pulled By: ezyang

fbshipit-source-id: 15d5c06e455ed1b5cf69c3b33906c098cb539f87
2020-06-09 14:02:59 -07:00
b84a7fbbc1 Fix error message in autograd (#39729)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/39729

Differential Revision: D21953429

Pulled By: albanD

fbshipit-source-id: 76aea69c5100371daaee7e5e386aac05e0b6a438
2020-06-09 13:54:49 -07:00
3bdbb27ddb Fix Gather::apply accessing moved tensors (#39733)
Summary:
Somehow this gets uncovered when I make changes in https://github.com/pytorch/pytorch/pull/39710
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39733

Differential Revision: D21956034

Pulled By: albanD

fbshipit-source-id: e6d9097d67a4d0951ae6ccd2c5fb4cd54c536960
2020-06-09 13:38:26 -07:00
f31aca3a11 Cleanup cuda install scripts for Windows jobs (#39712)
Summary:
1. Separate script
2. Don't print result in stdout
3. Don't collect logs if installation succeeds
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39712

Differential Revision: D21959449

Pulled By: malfet

fbshipit-source-id: 3379de28db0606632587a598c6721ff54f1e85b7
2020-06-09 13:28:54 -07:00
2633a9cca1 Adding LpNorm regularization for sparse features in DPER3 (#38582)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38582

Adding LpNorm regularization for sparse features in DPER3.  This is done using a sparse regularization op with run_after_optimizer (see D21003029).

* Added code calling new caffe2 operator from D21003029 to caffe2/python/regularizer.py
* Added l1norm and l2norm to sparse regularizer thrift definition.
* Added the new regularization references to test utils.
* Added a new file for unit tests "sparse_nn_sparse_reg_test.py"

Test Plan:
buck test mode/dev //caffe2/caffe2/fb/dper/layer_models/tests:sparse_nn_sparse_reg_test
buck test mode/dev //caffe2/caffe2/fb/dper/layer_models/tests:sparse_nn_reg_test

DPER canary: https://fburl.com/fblearner/rcp5yzeh
New DPER canary: https://fburl.com/fblearner/0krgd74x

Differential Revision: D20704248

fbshipit-source-id: 7e3d5013b3ff3da95ea027f0f2dd855f3ae8e41d
2020-06-09 12:34:50 -07:00
f1c60c04b8 [JIT] Fix module interface test (#39592)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/39592

Test Plan: Imported from OSS

Differential Revision: D21909659

Pulled By: jamesr66a

fbshipit-source-id: 831ae6b158041d4241209cee50f7a4d09cd2fcb2
2020-06-09 12:13:58 -07:00
3413f0a8ca Fix dll load failure in virtual environments on Windows (#39622)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/39620.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39622

Differential Revision: D21953420

Pulled By: malfet

fbshipit-source-id: ab0e0358327ec321130384e0a654987cd70349c0
2020-06-09 11:28:22 -07:00
18073ffca3 Add tests for mismatched dtypes in torch.gather. (#39689)
Summary:
https://github.com/pytorch/pytorch/pull/38646 added checks for this, but only added tets for the scatter functions.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39689

Reviewed By: malfet

Differential Revision: D21945524

Pulled By: gchanan

fbshipit-source-id: 8b06856c06d6427b8cd929a1275422a5ed6e11cc
2020-06-09 08:05:40 -07:00
8565ae5a76 Revert D21925406: [pytorch][PR] torch.multinomial : fast-path for replacement=False
Test Plan: revert-hammer

Differential Revision:
D21925406

Original commit changeset: f2ee5148fa7d

fbshipit-source-id: b1cac6ad463a83afb7eee83c6a9d575abf15072f
2020-06-09 07:28:52 -07:00
9733390998 Add torch.flip{lr, ud} (#38599)
Summary:
Reference: https://github.com/pytorch/pytorch/issues/38349

TODO:
* [x] Add Tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38599

Differential Revision: D21941884

Pulled By: mruberry

fbshipit-source-id: 7a442ff11051c2c868cf8e3c04e4bba0f1a1d426
2020-06-09 07:19:37 -07:00
4ec86ca5ba [iOS] Disable depthwise3x3_winograd on iOS (#39591)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/39591

Test Plan: Imported from OSS

Differential Revision: D21908298

Pulled By: xta0

fbshipit-source-id: 2b67909f34ca1d3ff0bed9ff6875e0ba2ae3b98e
2020-06-09 02:55:30 -07:00
7d85e77076 Use atomic operations to manipulate current RPC agent (#39663)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39663

I was investigating a memory corruption issue and thought it may be due to a race condition in (un)setting the current RPC agent. It turns out it wasn't (still investigating...). I had already written this fix, and it is a real fix (there could really be a race condition), so I'm sending it out to see whether there's interest in merging it. I believe its practical usefulness is however very limited, since typically the current RPC agent is only changed twice (at start and at shutdown) and thus there's limited risk for races.

As there may be some confusion on atomicity of shared_ptrs, let me clarify a few things from the get go. Operations on the control blocks of shared_ptrs (i.e., increasing and decreasing the refcounts) are atomic, which means that it is safe to manipulate *two different* shared_ptrs that point to the *same* object from *different* threads. However, the shared_ptr object itself is not atomic, which means that it is *not* safe to manipulate the *same* shared_ptr from two *different* threads. For that reason, the STL provides atomic functions explicitly specialized for shared_ptrs: https://en.cppreference.com/w/cpp/memory/shared_ptr/atomic (in C++ 20, they are being replaced by a specialization of std::atomic<std::shared_ptr<T>>). Note that this has been called "the worst question of all of C++" by Louis Brandy at his CppCon talk: https://youtu.be/lkgszkPnV8g?t=1210
ghstack-source-id: 105475005

Test Plan: Unit tests

Differential Revision: D21932817

fbshipit-source-id: da33fedd98efb820f284583ce7ff1c1c531dea9c
2020-06-09 02:11:15 -07:00
af05158c56 torch.multinomial : fast-path for replacement=False (#39636)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/39624 #11931

Based on the example by RobertoLat
https://github.com/pytorch/pytorch/issues/11931#issuecomment-625882503

**Fast-path is not taken on CPU for `Half` as `log` doesn't support it.**

Benchmark with same build settings on same system.
gcc : version 7.5.0 (Ubuntu 7.5.0-3ubuntu1~18.04)
CUDA : 10.1
GPU : 1050ti

```python
import time
import torch
import numpy as np

for n, t in [(500_000, 10),
             (1_000_000, 10)]:
    for dtype in (torch.half, torch.float, torch.double):
        # Input Setup
        p = torch.from_numpy(np.random.rand(n)).to(dtype)
        want = 1000
        print(f'torch.multinomial(a) a.numel() == {n} for {t} times {dtype}')
        start = time.time()
        # Iterate
        for _ in range(t):
            torch.multinomial(p, want, replacement=False)
        print(f'Took:', time.time() - start)

print('****' * 10)

for n, t in [(50_000, 100),
             (100_000, 100)]:
    for dtype in (torch.half, torch.float, torch.double):
        # Input Setup
        p = torch.rand(n, device='cuda', dtype=dtype)
        want = 1000
        print(f'torch.multinomial(a) a.numel() == {n} for {t} times {dtype}')
        start = time.time()
        # torch.cuda.synchronize()
        # Iterate
        for _ in range(t):
            torch.multinomial(p, want, replacement=False)
        # torch.cuda.synchronize()
        print(f'CUDA Took:', time.time() - start)
```

Before:

```
torch.multinomial(a) a.numel() == 500000 for 10 times torch.float16
Took: 80.64455389976501
torch.multinomial(a) a.numel() == 500000 for 10 times torch.float32
Took: 3.7778031826019287
torch.multinomial(a) a.numel() == 500000 for 10 times torch.float64
Took: 5.045570611953735
torch.multinomial(a) a.numel() == 1000000 for 10 times torch.float16
Took: 161.53191947937012
torch.multinomial(a) a.numel() == 1000000 for 10 times torch.float32
Took: 7.640851736068726
torch.multinomial(a) a.numel() == 1000000 for 10 times torch.float64
Took: 10.399673461914062
****************************************
torch.multinomial(a) a.numel() == 50000 for 100 times torch.float16
CUDA Took: 4.873984098434448
torch.multinomial(a) a.numel() == 50000 for 100 times torch.float32
CUDA Took: 4.713594436645508
torch.multinomial(a) a.numel() == 50000 for 100 times torch.float64
CUDA Took: 11.167185068130493
torch.multinomial(a) a.numel() == 100000 for 100 times torch.float16
CUDA Took: 7.195427417755127
torch.multinomial(a) a.numel() == 100000 for 100 times torch.float32
CUDA Took: 7.669712066650391
torch.multinomial(a) a.numel() == 100000 for 100 times torch.float64
CUDA Took: 20.20938801765442
```

After:

```
torch.multinomial(a) a.numel() == 500000 for 10 times torch.float16
Took: 80.6487455368042
torch.multinomial(a) a.numel() == 500000 for 10 times torch.float32
Took: 0.0663309097290039
torch.multinomial(a) a.numel() == 500000 for 10 times torch.float64
Took: 0.09588909149169922
torch.multinomial(a) a.numel() == 1000000 for 10 times torch.float16
Took: 161.60748076438904
torch.multinomial(a) a.numel() == 1000000 for 10 times torch.float32
Took: 0.13187885284423828
torch.multinomial(a) a.numel() == 1000000 for 10 times torch.float64
Took: 0.17609834671020508
****************************************
torch.multinomial(a) a.numel() == 50000 for 100 times torch.float16
CUDA Took: 0.007131099700927734
torch.multinomial(a) a.numel() == 50000 for 100 times torch.float32
CUDA Took: 0.022255420684814453
torch.multinomial(a) a.numel() == 50000 for 100 times torch.float64
CUDA Took: 0.0323028564453125
torch.multinomial(a) a.numel() == 100000 for 100 times torch.float16
CUDA Took: 0.04995012283325195
torch.multinomial(a) a.numel() == 100000 for 100 times torch.float32
CUDA Took: 0.04948878288269043
torch.multinomial(a) a.numel() == 100000 for 100 times torch.float64
CUDA Took: 0.05495333671569824
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39636

Differential Revision: D21925406

Pulled By: ngimel

fbshipit-source-id: f2ee5148fa7dd88e018c461ced0e2361c3a43796
2020-06-08 23:52:51 -07:00
338a1ccce5 Fix error handling for rpc.remote (#39605)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39605

1. `RRef.to_here()` could serialize a Python object into a message.
However, we did not catch the Python pickle error, which would
result in crash. This was exposed when calling `rpc.remote` with
a user function that returns `torch.futures.Future`.
2. `rpc.function.async_execution` could throw error on the server.
This commit sets the error on the OwnerRRef properly.

Test Plan: Imported from OSS

Differential Revision: D21913820

Pulled By: mrshenli

fbshipit-source-id: 50b620641a3b89d310b3b907570561decd83ee34
2020-06-08 22:57:24 -07:00
aa5ccf9626 Kill dead pairwise ops in THC (#39070)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/39070

Differential Revision: D21947797

Pulled By: ngimel

fbshipit-source-id: a556a1b61b814b90284a6de3a7a2c3d0793bb908
2020-06-08 22:40:21 -07:00
1790d35848 Skip test_minmax_illegal_dtype on XLA (#39693)
Summary:
It's better to have skipping logic explicitly defined in test decorators rather than in some hard-to-find blacklists
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39693

Differential Revision: D21947893

Pulled By: malfet

fbshipit-source-id: 3d0855eda7e10746ead80fccf84a8db8bf5a3ef1
2020-06-08 22:34:44 -07:00
0251ba6108 Fix ONNX export of RNNs with no bias (#36894)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/34800 .

Currently, the LSTM/RNN/GRU export to ONNX can't handle models without a bias term.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36894

Reviewed By: hl475

Differential Revision: D21134794

Pulled By: houseroad

fbshipit-source-id: e71e089025a3dc7e8c883ff99cd788c5f302492e
2020-06-08 20:36:22 -07:00
9f71997380 some refactor on register_distributed_ops (#38657)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/38657

Test Plan: Imported from OSS

Differential Revision: D21940442

Pulled By: wanchaol

fbshipit-source-id: c60c1ac3ede355c276e0d03fb13ff301698f6acd
2020-06-08 19:43:46 -07:00
f32c9eb579 [jit] register distributed backward (#38494)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38494

This register distributed.autograd.backward to jit

Test Plan: Imported from OSS

Differential Revision: D21596133

Pulled By: wanchaol

fbshipit-source-id: b64343010616a636304de54ae74ad4fb83445a62
2020-06-08 19:43:40 -07:00
d493918436 [dist_autograd] expose distributed backward C++ API (#38656)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/38656

Test Plan: Imported from OSS

Differential Revision: D21940441

Pulled By: wanchaol

fbshipit-source-id: e9d35201825912f5e7d7e1d0a71586abe5a6f71c
2020-06-08 19:42:21 -07:00
e033db0477 Enable RRef timeout for tensorpipe (#39531)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39531

Enables RRef timeout support in TP agent by having TP agent mark
timeout errors with `makeRPCError` API. Also does some refactoring so TP agent
can print out the timeout for each future that has timed out.
ghstack-source-id: 105461555

Test Plan: CI

Differential Revision: D21881475

fbshipit-source-id: f63300e1f0a80ac7eebc983752070c0ec6ac17a6
2020-06-08 19:08:49 -07:00
afb2d27b24 Migrate AT_DISPATCH_FLOATING_AND_COMPLEX_TYPES to c10::complex (#39296)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/39296

Differential Revision: D21825114

Pulled By: anjali411

fbshipit-source-id: 9dd4719282591d635a64001d27a649c86fb5022c
2020-06-08 18:54:15 -07:00
d1cdf1fd56 update convert_sync_batchnorm docs (#39646)
Summary:
fix some inaccuracies
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39646

Differential Revision: D21930023

Pulled By: mrshenli

fbshipit-source-id: 9c6b8eeefeb0482a6ae7f825ae055090ce589223
2020-06-08 18:42:42 -07:00
1f7557d173 Migrate diag and trace from TH to ATen (CUDA) (#36876)
Summary:
Closes https://github.com/pytorch/pytorch/issues/24549 #24649

## diag

Benchmark with same build settings on same system.
gcc : version 7.5.0 (Ubuntu 7.5.0-3ubuntu1~18.04)
CUDA : 10.1
GPU : 1050ti

```python
import time
import torch
import timeit
import math

for n, t in [(100, 20000),
             (400, 20000)]:
    for dtype in (torch.int8,torch.int16, torch.int32, torch.int64, torch.float, torch.double):
        # Input Setup
        a = torch.arange(n, dtype=dtype, device="cuda")
        b = a.reshape((int(math.sqrt(n)), int(math.sqrt(n))))
        print(f'torch.diag a.numel() == {n} for {t} times {dtype}')
        for inp, inp_name in [(a, '1-D'), (b, '2-D')]:
            start = time.time()
            torch.cuda.synchronize()
            # Iterate
            for _ in range(t):
                torch.diag(inp)
            # Final Synchronize Before Teardown
            torch.cuda.synchronize()
            print(inp_name + " Took:", time.time() - start)
```

|Dtype | Before | After |
|------|--------|-------|
| int8-Elems:100 | 1-D Took: 0.20730137825012207<br />2-D Took: 0.12553787231445312<br /> | 1-D Took: 0.33618664741516113<br />2-D Took: 0.1264970302581787<br /> |
| int16-Elems:100 | 1-D Took: 0.2127547264099121<br />2-D Took: 0.12582707405090332<br /> | 1-D Took: 0.2146449089050293<br />2-D Took: 0.12558245658874512<br /> |
| int32-Elems:100 | 1-D Took: 0.2106609344482422<br />2-D Took: 0.12958312034606934<br /> | 1-D Took: 0.2121574878692627<br />2-D Took: 0.1264948844909668<br /> |
| int64-Elems:100 | 1-D Took: 0.20768976211547852<br />2-D Took: 0.1256253719329834<br /> | 1-D Took: 0.2077159881591797<br />2-D Took: 0.12476921081542969<br /> |
| float32-Elems:100 | 1-D Took: 0.2137584686279297<br />2-D Took: 0.12708187103271484<br /> | 1-D Took: 0.21565628051757812<br />2-D Took: 0.1275336742401123<br /> |
| float64-Elems:100 | 1-D Took: 0.21710658073425293<br />2-D Took: 0.12845087051391602<br /> | 1-D Took: 0.219193696975708<br />2-D Took: 0.1264345645904541<br /> |
| int8-Elems:400 | 1-D Took: 0.20585918426513672<br />2-D Took: 0.1257162094116211<br /> | 1-D Took: 0.20970797538757324<br />2-D Took: 0.12455391883850098<br /> |
| int16-Elems:400 | 1-D Took: 0.20943427085876465<br />2-D Took: 0.12425971031188965<br /> | 1-D Took: 0.21483230590820312<br />2-D Took: 0.12662172317504883<br /> |
| int32-Elems:400 | 1-D Took: 0.21058869361877441<br />2-D Took: 0.1312875747680664<br /> | 1-D Took: 0.2092602252960205<br />2-D Took: 0.12785696983337402<br /> |
| int64-Elems:400 | 1-D Took: 0.287722110748291<br />2-D Took: 0.12862586975097656<br /> | 1-D Took: 0.28710484504699707<br />2-D Took: 0.12852025032043457<br /> |
| float32-Elems:400 | 1-D Took: 0.21535277366638184<br />2-D Took: 0.1278238296508789<br /> | 1-D Took: 0.2140669822692871<br />2-D Took: 0.1268482208251953<br /> |
| float64-Elems:400 | 1-D Took: 0.28638601303100586<br />2-D Took: 0.13219022750854492<br /> | 1-D Took: 0.28608059883117676<br />2-D Took: 0.13063836097717285<br /> |

## trace

Benchmark with same build settings on same system.
gcc : version 7.5.0 (Ubuntu 7.5.0-3ubuntu1~18.04)
CUDA : 10.1
GPU : 1050ti

```python
import time
import torch
import timeit
import math

for n, t in [(10000, 20000),
             (40000, 20000)]:
    for dtype in (torch.int8,torch.int16, torch.int32, torch.int64, torch.float, torch.double):
        # Input Setup
        a = torch.arange(n, dtype=dtype, device="cuda")
        a = a.reshape((int(math.sqrt(n)), int(math.sqrt(n))))
        print(f'torch.trace a.numel() == {n} for {t} times {dtype}')
        start = time.time()
        torch.cuda.synchronize()
        # Iterate
        for _ in range(t):
            torch.trace(a)
        # Final Synchronize Before Teardown
        torch.cuda.synchronize()
        print("Took:", time.time() - start)
```

|Dtype | Before | After |
|------|--------|-------|
| int8-Elems:10000 | Took: 0.4376576900482178<br /> | Took: 0.42725276947021484<br /> |
| int16-Elems:10000 | Took: 0.4334981441497803<br /> | Took: 0.4376239776611328<br /> |
| int32-Elems:10000 | Took: 0.43313121795654297<br /> | Took: 0.43097853660583496<br /> |
| int64-Elems:10000 | Took: 0.28386616706848145<br /> | Took: 0.2827033996582031<br /> |
| float32-Elems:10000 | Took: 0.2905247211456299<br /> | Took: 0.2914285659790039<br /> |
| float64-Elems:10000 | Took: 0.29450368881225586<br /> | Took: 0.2907843589782715<br /> |
| int8-Elems:40000 | Took: 0.4255516529083252<br /> | Took: 0.41020774841308594<br /> |
| int16-Elems:40000 | Took: 0.4287736415863037<br /> | Took: 0.42923426628112793<br /> |
| int32-Elems:40000 | Took: 0.43021249771118164<br /> | Took: 0.42778849601745605<br /> |
| int64-Elems:40000 | Took: 0.2852292060852051<br /> | Took: 0.28212475776672363<br /> |
| float32-Elems:40000 | Took: 0.29549574851989746<br /> | Took: 0.29524707794189453<br /> |
| float64-Elems:40000 | Took: 0.29451632499694824<br /> | Took: 0.2894322872161865<br /> |
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36876

Differential Revision: D21940588

Pulled By: ngimel

fbshipit-source-id: f0ec59b1d16a51690390a002b7c46eec93f0b092
2020-06-08 18:27:18 -07:00
64192ca3da Skip unit tests relying on MKL if compiled without it (#39672)
Summary:
Also skip TestTorchDeviceTypeCPU.test_float_to_int_conversion_finite_cpu_uint8 on PowerPC
See example of tests failures on https://powerci.osuosl.org/job/pytorch-master-nightly-py3-linux-ppc64le/1099/console for
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39672

Differential Revision: D21943588

Pulled By: malfet

fbshipit-source-id: 3da0d33597db5aa8728e682b8e27dd5f7f6765f4
2020-06-08 17:52:00 -07:00
8004d35979 Remove tuple from reduction (#39433)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/39433

Differential Revision: D21940652

Pulled By: ngimel

fbshipit-source-id: fca084fdf789bd2ea765cc383ae394bf94c1510b
2020-06-08 17:24:40 -07:00
9551fb22d6 [quant][graphmode] Preserve numerics in debug option for clamp ops (#39219)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39219

We didn't model clamp ops correctly right now, this PR fixes that.

Reason is quantized clamp op quantizes the scalar arguments in the op implementation: https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/native/quantized/cpu/kernels/QuantizedOpKernels.cpp#L614-L617

So we'll need to model this explicitly in the IR.
When we see a `aten::dequantize - aten::clamp(%x, %min, %max)`
we first make a scalar tensor with `aten::scalar_tensor(%scalar, ...)`, then we quantize the tensor with the same quantization parameters from the input tensor of the `aten::clamp`, dequantize the tensor, then convert the dequantized tensor to scalar using `aten::item`.

Test Plan: Imported from OSS

Differential Revision: D21831350

fbshipit-source-id: d60731459a0465d64946aabc62065d25d92faefc
2020-06-08 17:15:39 -07:00
dd5aa1fb22 Cleanup unused args in max_unpooling3d (#39664)
Summary:
Reported by dlibenzi (thanks!) that these arguments are not used in the implementation.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39664

Differential Revision: D21934989

Pulled By: ailzhang

fbshipit-source-id: 35e79ce7f49626c8ad79362f972e442c06022dcc
2020-06-08 16:35:52 -07:00
b7b7433561 setup: Add long description to wheel packages (#39676)
Summary:
Closes out https://github.com/pytorch/pytorch/issues/38354

For reference: https://packaging.python.org/guides/making-a-pypi-friendly-readme/

Should fill out the PyPI description as well.

Signed-off-by: Eli Uriegas <eliuriegas@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39676

Reviewed By: malfet

Differential Revision: D21940656

Pulled By: seemethere

fbshipit-source-id: 6c39500404227047d8f24936db0697fe44a6b9e8
2020-06-08 16:25:39 -07:00
84d8d68397 .circleci: Fold postnightly workfow into nightly (#39669)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39669

Folds the postnightly workflow, including html updating jobs and binary
size jobs, into the regular nightly workflow that should only run after
all upload jobs have completed.

This also moves the smoke testing jobs into the binary_builds workflow.

Do note that the devtoolset7 html update job has been removed since we
do not upload binaries specifically to that location anymore.

Signed-off-by: Eli Uriegas <eliuriegas@fb.com>

Test Plan: Imported from OSS

Differential Revision: D21936811

Pulled By: seemethere

fbshipit-source-id: a062413b69bafe0a85173020e8b218b375124106
2020-06-08 16:11:59 -07:00
0147216a46 [TensorPipe Agent] Documentation fixes and nits (#39467)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39467

Mainly cleaned up some docs and spelling/grammar nits
ghstack-source-id: 105457210

Test Plan: Sandcastle/CI

Differential Revision: D21755354

fbshipit-source-id: 28e7b925ace7813548a1bf8cdcf96cd423a227aa
2020-06-08 15:16:41 -07:00
bba30d1bd8 Add undefined tensor gradient support to all backward functions (#39400)
Summary:
Adds the ability for all backward functions to accept undefined output gradient arguments. An undefined gradient is a Tensor that was created by the argumentless constructor `at::Tensor()`, where `tensor.defined() == false`.

Also adds new autograd nodes, UndefinedGrad and UndefinedGradBackward, that can be used from within Python code to inject undefined gradients into a backward function. A new test case is added to the backward function unit tests to use the UndefinedGrad node to ensure that undefined gradients do not break any backward functions.

Closes https://github.com/pytorch/pytorch/issues/33138
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39400

Differential Revision: D21936588

Pulled By: albanD

fbshipit-source-id: eccc5f55c77babe6dadcea4249d0c68a3c64e85d
2020-06-08 14:13:53 -07:00
8251f1872f .circleci: Move ecr gc build job to ecr gc workflow (#38523)
Summary:
It didn't really make sense for it to be where it was and seeing how the
build only actually takes about 5 minutes to do it'd be best to just
move it into the garbage collection workflow.

Signed-off-by: Eli Uriegas <eliuriegas@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38523

Reviewed By: malfet

Differential Revision: D21937332

Pulled By: seemethere

fbshipit-source-id: 6b797a6af88549dbd5ccce88814a1428354ce7f2
2020-06-08 13:13:30 -07:00
83dd56632e Fast tanh for the LLVM backend. (#39528)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/39528

Test Plan: Imported from OSS

Differential Revision: D21927791

Pulled By: zheng-xq

fbshipit-source-id: a3f7d79bf0d3a399000ffd7ff4d0502ba365f1dc
2020-06-08 13:02:16 -07:00
df2d19723a c10/util/complex_math.h and c10/util/complex_utils.h should not be individually included (#39276)
Summary:
Add a compilation error if they are individually included. Devs should
instead include c10/util/complex_type.h (which includes these two files).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39276

Differential Revision: D21924922

Pulled By: ezyang

fbshipit-source-id: ad1034be5d9d694b18cc5f03a44f540f10de568c
2020-06-08 11:52:18 -07:00
397b24bb37 Cleanup rref_impl (#39530)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39530

Some cleanups for consistency and code reuse. Uses torch_check instead
of explicitly throwing runtime error. Calls RRefContext::handleError() for
default error handler fallback.
ghstack-source-id: 105424164

Test Plan: CI

Differential Revision: D21881244

fbshipit-source-id: c706244869e5ddb915f9d8e4f81d1365b4b57321
2020-06-08 11:47:09 -07:00
a2125135ee [predictor] move fblearner/predictor to platform009
Summary: This is to test predictor on platform009

Test Plan:
```
fbpkg build -E  fblearner/predictor

fbpkg build -E  fblearner/predictor_proxy
```

# Performance test
## ServiceLab experiments

https://fburl.com/servicelab/p2xo4c85

## Perf A/B test

perf_b is platform-009

https://fburl.com/ods/59kdhdf9

perf_a is platform-09

https://fburl.com/ods/gjctzpe3

Differential Revision: D20552379

fbshipit-source-id: d6d9094aedfb2c1db623d44108627e8e00dde47e
2020-06-08 11:31:33 -07:00
ab6c447f59 [ROCm] Enable AMP autocast tests on ROCm (#39616)
Summary:
Enables AMP autocast tests on ROCm.

ezyang jeffdaily
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39616

Differential Revision: D21924219

Pulled By: ezyang

fbshipit-source-id: f4df4ad32cd8fae8c4620cd8ab18b00d74fb46bd
2020-06-08 10:30:39 -07:00
cc2f7fa502 Revert D21930435: Revert D17923732: Optimize GroupNorm on CUDA
Test Plan: revert-hammer

Differential Revision:
D21930435

Original commit changeset: 53bd5db7d61e

fbshipit-source-id: 5418393b9207a387b0f448477250354cbc50fdb9
2020-06-08 08:46:55 -07:00
e4f9c74db3 add dtype checks for scatter/gather family of functions. (#38646)
Summary:
Adds additional dtype checks for scatter/gather family of functions, namely:
1. Checks whether `index` is of type `Long`
2. Checks whether `src.dtype == self.dtype`.

Fixes [https://github.com/pytorch/pytorch/issues/38554](https://github.com/pytorch/pytorch/issues/38554)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38646

Differential Revision: D21883033

Pulled By: gchanan

fbshipit-source-id: 4bbd48ec0706ddb002318742edba640871ec0162
2020-06-08 08:42:00 -07:00
e3e8f24cbe Remove duplicate 'with_gil' declaration. (#39540)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39540

This gets picked up by mypy as an error in 1.5.1, not sure if it's a different version or setting, but might as well fix.

Test Plan: Imported from OSS

Differential Revision: D21891772

Pulled By: gchanan

fbshipit-source-id: 6f95bcd0652007323cd0c79070425b64e0b71c55
2020-06-08 08:34:38 -07:00
a83f7a1d70 Revert D17923732: Optimize GroupNorm on CUDA
Test Plan: revert-hammer

Differential Revision:
D17923732

Original commit changeset: 9afaf01288bd

fbshipit-source-id: 53bd5db7d61e5eda8d7953d7f6321e54321d7ac2
2020-06-08 08:14:12 -07:00
e41fe60867 Add error message when negative stride is passed to as_strided (#39508)
Summary:
Fixes this issue https://github.com/pytorch/pytorch/issues/33290.
Builds upon this PR https://github.com/pytorch/pytorch/pull/33392.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39508

Differential Revision: D21890557

Pulled By: zou3519

fbshipit-source-id: 8e1a9afb064a6e19551bf3ede3103dd3f023c660
2020-06-08 07:45:24 -07:00
820e81ba09 add overload name for min/max with list input (#39614)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39614

add overload name to differentiate

prim::min.int(int a, int b) -> (int)
prim::min.int(int[] l, int[] r) -> (int[])

Test Plan:
verified op names for aten::min and aten::max are different

before
```
prim::min.int(int a, int b) -> (int)
prim::min.float(float a, float b) -> (float)
prim::min.int_float(int a, float b) -> (float)
prim::min.float_int(float a, int b) -> (float)
prim::min(Scalar a, Scalar b) -> (Scalar)
prim::max.int(int a, int b) -> (int)
prim::max.float(float a, float b) -> (float)
prim::max.int_float(int a, float b) -> (float)
prim::max.float_int(float a, int b) -> (float)
prim::max(Scalar a, Scalar b) -> (Scalar)

prim::min.int(int[] l, int[] r) -> (int[])
prim::max.int(int[] l, int[] r) -> (int[])
prim::min.self_int(int[] self) -> (int)
prim::max.self_int(int[] self) -> (int)
prim::min.float(float[] l, float[] r) -> (float[])
prim::max.float(float[] l, float[] r) -> (float[])
prim::min.self_float(float[] self) -> (float)
prim::max.self_float(float[] self) -> (float)
prim::min.bool(bool[] l, bool[] r) -> (bool[])
prim::max.bool(bool[] l, bool[] r) -> (bool[])
prim::min.self_bool(bool[] self) -> (bool)
prim::max.self_bool(bool[] self) -> (bool)
```

after
```
prim::min.int(int a, int b) -> (int)
prim::min.float(float a, float b) -> (float)
prim::min.int_float(int a, float b) -> (float)
prim::min.float_int(float a, int b) -> (float)
prim::min(Scalar a, Scalar b) -> (Scalar)
prim::max.int(int a, int b) -> (int)
prim::max.float(float a, float b) -> (float)
prim::max.int_float(int a, float b) -> (float)
prim::max.float_int(float a, int b) -> (float)
prim::max(Scalar a, Scalar b) -> (Scalar)

prim::min.int_list(int[] l, int[] r) -> (int[])
prim::max.int_list(int[] l, int[] r) -> (int[])
prim::min.self_int(int[] self) -> (int)
prim::max.self_int(int[] self) -> (int)
prim::min.float_list(float[] l, float[] r) -> (float[])
prim::max.float_list(float[] l, float[] r) -> (float[])
prim::min.self_float(float[] self) -> (float)
prim::max.self_float(float[] self) -> (float)
prim::min.bool_list(bool[] l, bool[] r) -> (bool[])
prim::max.bool_list(bool[] l, bool[] r) -> (bool[])
prim::min.self_bool(bool[] self) -> (bool)
prim::max.self_bool(bool[] self) -> (bool)
```

Reviewed By: iseeyuan

Differential Revision: D21914844

fbshipit-source-id: f1792a8c3b3ed6d1a4ba9705c4504f15e3665126
2020-06-08 06:13:10 -07:00
b83fed8d4c [futures] Add c++ ivalue::Future collectAll() helper (#39119)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39119

Add some base c++ unittest coverage for ivalue::Future, and in
the process, add a basic collectAll() primitive, per 38937.

In the process, I realized that List<Future> is effectively
impossible to construct (since the Future's type is not templated,
but rather passed in,  the getTypePtr_<T>::call() isn't defined),
so added a workaround in List to make it possible.
ghstack-source-id: 105309650

Test Plan: buck test mode/dev-nosan caffe2/test/cpp/jit/...

Differential Revision: D21756884

fbshipit-source-id: 5d40c8d1c55098de5497655c7b887f4f56508a37
2020-06-08 05:52:09 -07:00
172f31171a [quant] QNNPACK deconv kernel and tests (#36790)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/36790

Test Plan: Imported from OSS

Differential Revision: D21110111

fbshipit-source-id: 548df3a9853ad33d21d279393b91d1691050d4c4
2020-06-08 00:31:25 -07:00
6c56671fd9 [jit] avoid pre-convert tensor to cpu in pickling (#38898)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38898

Pickling will pickle the tensor meta info, and its up to the jit
exporter or other upstream who use the pickler to decide how to write
the actual tensor data.

This PR make we call getWritableTensorData in upper level so that rpc
and TensorPipe can leverge it with only pickling tensor meta data without
converting the tensor from GPU to CPU.

Test Plan: Imported from OSS

Differential Revision: D21879866

Pulled By: wanchaol

fbshipit-source-id: 75f7ff4073e4ad15b6588973dcbdc48f97a8329f
2020-06-07 21:28:33 -07:00
1db4a31d92 [quant] QNNPACK deconvolution packing (#37405)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/37405

Test Plan: Imported from OSS

Differential Revision: D21301246

fbshipit-source-id: be72e777a211d414d40e2912dbc2e0ec640c6b32
2020-06-07 20:49:06 -07:00
ee2bc13f44 Fix smoke test jobs (#39638)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/39626.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39638

Differential Revision: D21924224

Pulled By: ezyang

fbshipit-source-id: 8da75e401bfbff5e11ceeccefd77d0fad81356e4
2020-06-07 17:05:44 -07:00
b06b792bbd remove double registered ops (#39609)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39609

These two ops are registered twice in the same file

duplicated op: aten::_infer_size(int[] a, int[] b) -> (int[])
duplicated op: aten::_no_grad_embedding_renorm_(Tensor weight, Tensor input, float max_norm, float norm_type) -> (Tensor)

Test Plan: compile

Reviewed By: iseeyuan

Differential Revision: D21915104

fbshipit-source-id: e0147c76e3c84c02952927a7e158ccb92449640c
2020-06-07 16:25:29 -07:00
8177637374 remove duplicated op schema for aten::pow (#39606)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39606

Removed duplicated schema for aten::pow

Test Plan:
Previously there are many duplicated aten::pow

```
aten::pow.int(int a, int b) -> (float)
aten::pow.float(float a, float b) -> (float)
aten::pow.int_float(int a, float b) -> (float)
aten::pow.float_int(float a, int b) -> (float)
aten::pow(Scalar a, Scalar b) -> (float)
aten::pow.int(int a, int b) -> (int)  // duplicated name!
aten::pow.float(float a, float b) -> (float) // duplicated schema!
aten::pow.int_float(int a, float b) -> (float) // duplicated schema!
aten::pow.float_int(float a, int b) -> (float) // duplicated schema!
aten::pow(Scalar a, Scalar b) -> (Scalar) // duplicated name!
```

After this diff, there are only 7 ops with different overload name:
```
aten::pow.int(int a, int b) -> (float)
aten::pow.float(float a, float b) -> (float)
aten::pow.int_float(int a, float b) -> (float)
aten::pow.float_int(float a, int b) -> (float)
aten::pow(Scalar a, Scalar b) -> (float)
aten::pow.Scalar(Scalar a, Scalar b) -> (Scalar)
aten::pow.int_to_int(int a, int b) -> (int)
```

Reviewed By: iseeyuan

Differential Revision: D21914441

fbshipit-source-id: 1e82c83c77d22206046276bbb52a65088c58ed33
2020-06-07 16:17:34 -07:00
614dd03272 Optimize GroupNorm on CUDA (#28204)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/28204

Optimize GroupNorm on CUDA
ghstack-source-id: 105388365

Test Plan: buck test mode/dev-nosan caffe2/test:nn -- "GroupNorm"

Reviewed By: houseroad

Differential Revision: D17923732

fbshipit-source-id: 9afaf01288bd9d273eed89909bff77243df89e9f
2020-06-07 14:34:01 -07:00
ebdff07d49 instancenorm: static quant graph mode support (#39096)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39096

Hooks up instancenorm for graph mode static quant

Test Plan:
```
python test/test_quantization.py TestQuantizeScriptPTSQOps.test_instance_norm
```

Imported from OSS

Differential Revision: D21885258

fbshipit-source-id: 650cc5b162dda044866176fea6c345082d9788ed
2020-06-07 13:38:28 -07:00
b443ca26c5 groupnorm: graph mode static quant support (#39095)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39095

Hooks up groupnorm to graph mode static quant

Test Plan:
```
python test/test_quantization.py TestQuantizeScriptPTSQOps.test_group_norm
```

Imported from OSS

Differential Revision: D21885257

fbshipit-source-id: 3415c4de76181b026d2f5bfebab130fea29e1d1e
2020-06-07 13:38:22 -07:00
952deba828 layernorm: eager mode qat support (#39094)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39094

Adds eager mode QAT handling for LayerNorm

Test Plan:
```
python test/test_quantization.py TestQuantizationAwareTraining.test_normalization
```

Imported from OSS

Differential Revision: D21885260

fbshipit-source-id: 4f4c84a8bb8ba15dd78494f92569ed3a30d89169
2020-06-07 13:38:16 -07:00
b530176d10 instancenorm: eager mode QAT support (#39093)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39093

Adds eager mode QAT support for instancenorm

Test Plan:
```
python test/test_quantization.py TestQuantizationAwareTraining.test_normalization
```

Imported from OSS

Differential Revision: D21885264

fbshipit-source-id: 7753995eed895bad26f713a857c6b0d194ea99d9
2020-06-07 13:38:10 -07:00
202625ba9e groupnorm: eager mode QAT support (#39092)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39092

Adds eager mode QAT support for GroupNorm.

Test Plan:
```
python test/test_quantization.py TestQuantizationAwareTraining.test_normalization
```

Imported from OSS

Differential Revision: D21885261

fbshipit-source-id: 0352e6a830e6384e7ad747067f8bf8ad64ab7fa8
2020-06-07 13:38:05 -07:00
2140874228 instancenorm: eager static quant support (#39091)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39091

Adds eager mode static quant support for instancenorm.

Test Plan:
```
python test/test_quantization.py TestPostTrainingStatic.test_normalization
python test/test_quantization.py TestStaticQuantizedModule.test_instance_norm
```

Imported from OSS

Differential Revision: D21885265

fbshipit-source-id: 277506faf108f3561867cd8449a2390b7a44c462
2020-06-07 13:37:59 -07:00
f9b675f7b6 groupnorm: eager static quant support (#39090)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39090

Makes quantized GroupNorm work in eager mode post training static quant.

Test Plan:
```
python test/test_quantization.py TestPostTrainingStatic.test_normalization
python test/test_quantization.py TestStaticQuantizedModule.test_group_norm
```

Imported from OSS

Differential Revision: D21885262

fbshipit-source-id: 58b0ffb59c601fcb4c79f711c7c98a667ffc6170
2020-06-07 13:37:53 -07:00
26bc272793 quant: clean up normalization channels_last handling (#37802)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37802

* adds test coverage for channels_last input format for quantized normalization ops
* fixes quantized group_norm and instance_norm to always return contiguous tensors

Test Plan:
```
python test/test_quantization.py TestQuantizedOps.test_group_norm
python test/test_quantization.py TestQuantizedOps.test_qlayer_norm
python test/test_quantization.py TestQuantizedOps.test_instance_norm
```

Imported from OSS

Differential Revision: D21395196

fbshipit-source-id: df55e842fe93ae594a336f1b115faea9ba3c88c1
2020-06-07 13:35:49 -07:00
8a4597b808 [quant][graphmode] Dynamic quantInsert observers for module output (#39458)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39458

Previously if we had a CallMethod followed by a CallFunction, we didn't check for observers at output of CallMethod since it was handled separately.
This change makes it default to check the outputs of all nodes to identify values that need observers

Test Plan:
python test/test_quantization.py test_dynamic_shared_weights

Imported from OSS

Differential Revision: D21872939

fbshipit-source-id: 08dd8b7ddf73ef2cc26ebcf4ceb2f222c4559ab3
2020-06-07 11:11:23 -07:00
67115b226a [quant][graphmode] Dynamic Quant Do not depend on input shapes (#39412)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39412

This PR introduces changes to enable running the weight observer standalone in the graph
It extracts the nodes from the graph that correspond to the observed weight value and adds all the related nodes to a new subgraph
The subgraph is then executed using GraphFunction

Test Plan:
python test/test_quantization.py TestGraphMostPostTrainingStatic
python test/test_quantization.py TestQuantizeDynamicScript

Imported from OSS

Differential Revision: D21872940

fbshipit-source-id: 55f1dcc2caef193531e2b807c8e56288b9794520
2020-06-07 11:09:44 -07:00
6d13b583a7 [quant][graphmode] Support conv*d_relu in traced models (#39490)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/39490

Test Plan: Imported from OSS

Differential Revision: D21917117

fbshipit-source-id: c96633aaaa347529cc1ca6ca1c982cfb04675ccf
2020-06-07 07:36:20 -07:00
faf0a3bd7a Move bernoulli_() to DistributionTemplates (#38558)
Summary:
resolve the feature introduced in https://github.com/pytorch/pytorch/issues/37373
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38558

Differential Revision: D21920685

Pulled By: pbelevich

fbshipit-source-id: 50c77d9aaa334b3276a2352afe6c4ad03f12be31
2020-06-07 07:18:30 -07:00
a25b1b918b Fix __STDC_FORMAT_MACROS redefinition issue for TypeDerived (#39608)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39608

As title. When adding a new build mode TypeDerived failed to compile due to macro redefinition. Conditional define fixes this issue.

Test Plan: Tests pass.

Reviewed By: iseeyuan

Differential Revision: D21914975

fbshipit-source-id: 12e04af29b7510106e8e47fa48e30b829aeff467
2020-06-07 00:45:54 -07:00
183b04da3e [pytorch] remove tracing logic from gen_variable_factories.py (#39514)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39514

For methods in `variable_factories.h`, we set `AutoNonVariableTypeMode` guard before dispatching, which also disables tracing as side effort, so we need replicate the tracing logic.
Now as we have created separate TraceType, we should be able to remove tracing from `variable_factories.h`.

Example of old code:
```
inline at::Tensor arange(at::Scalar start, at::Scalar end, at::Scalar step, const at::TensorOptions & options = {}) {
  #if !defined(PYTORCH_DISABLE_TRACING)
  torch::jit::Node* node = nullptr;
  std::shared_ptr<jit::tracer::TracingState> tracer_state;
  if (jit::tracer::isTracing()) {
    tracer_state = jit::tracer::getTracingState();
    at::Symbol op_name;
    op_name = jit::Symbol::fromQualString("aten::arange");
    node = tracer_state->graph->create(op_name, /*num_outputs=*/0);
    jit::tracer::recordSourceLocation(node);
    jit::tracer::addInputs(node, "start", start);
    jit::tracer::addInputs(node, "end", end);
    jit::tracer::addInputs(node, "step", step);
    jit::tracer::addInputs(node, "options", options);
    tracer_state->graph->insertNode(node);

    jit::tracer::setTracingState(nullptr);
  }
  #endif
  at::Tensor tensor = ([&]() {
    at::AutoNonVariableTypeMode non_var_type_mode(true);
    return at::arange(start, end, step, at::TensorOptions(options));
  })();
  at::Tensor result =
    autograd::make_variable(std::move(tensor), /*requires_grad=*/options.requires_grad());
  #if !defined(PYTORCH_DISABLE_TRACING)
  if (tracer_state) {
    jit::tracer::setTracingState(std::move(tracer_state));
    jit::tracer::addOutput(node, result);
  }
  #endif
  return result;
}
```

Example of new code:
```
inline at::Tensor arange(at::Scalar start, at::Scalar end, at::Scalar step, const at::TensorOptions & options = {}) {
  at::Tensor tensor = ([&]() {
    at::AutoNonVariableTypeMode non_var_type_mode(true);
    return at::arange(start, end, step, at::TensorOptions(options));
  })();
  at::Tensor result =
    autograd::make_variable(std::move(tensor), /*requires_grad=*/options.requires_grad());
  return result;
}
```
ghstack-source-id: 105407617

Test Plan: CI

Differential Revision: D21880936

fbshipit-source-id: 19a4330eed5bc1ee956ad1c638a9658e7a1ce283
2020-06-07 00:17:48 -07:00
9db27a50b4 [pytorch] add operator name to callBoxed() error message (#39562)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39562

Improve error message to help debug missing boxed kernel error.
ghstack-source-id: 105407643

Test Plan: CI

Differential Revision: D21900540

fbshipit-source-id: 3d977bf2a7b886be9b3f940342c9bc5e186479e4
2020-06-07 00:01:25 -07:00
e4627e5dba [quant][graphmode] Fix add_relu patterns for scripting and tracing (#39455)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39455

1. enable filters in PatternInfo
2. add aten_add_alpha_is_one filter
3. add is_functional_relu filter
4. add is_relu_module filter
5. fix the relu module method call matching in traced modules with regex
6. add aten::add - aten::relu patterns for traced modules

Test Plan: Imported from OSS

Differential Revision: D21917118

fbshipit-source-id: e67b55cd1c070fd4238f563d933a6f10a3582ae3
2020-06-06 23:51:34 -07:00
2da5444221 [Resubmit] Fix argmin/max bug (#39576)
Summary:
Fix https://github.com/pytorch/pytorch/issues/38922

See previous PR: https://github.com/pytorch/pytorch/pull/38946

cc: ngimel
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39576

Differential Revision: D21906490

Pulled By: ngimel

fbshipit-source-id: f3bfb4e14c4cee60a1e3b80c049945ce85f9f494
2020-06-06 23:47:12 -07:00
644d6a09e6 add overload name for aten::as_tensor (#39610)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39610

add overload name for aten::as_tensor

there are two aten::as_tensor

```
aten::as_tensor.float(float t, *, int? dtype=None, Device? device=None) -> (Tensor)
aten::as_tensor.int(int t, *, int? dtype=None, Device? device=None) -> (Tensor)
aten::as_tensor.bool(bool t, *, int? dtype=None, Device? device=None) -> (Tensor)
aten::as_tensor(t[] data, *, int? dtype=None, Device? device=None) -> (Tensor)
aten::as_tensor(Tensor(a) data, *, int? dtype=None, Device? device=None) -> (Tensor(b|a))

```

change one to aten::as_tensor.list

Test Plan:
verified no duplicated op name after this diff

This is the full list:

```
prim::TupleUnpack(Any tup) -> (...)
prim::unchecked_cast(t x) -> (t)
aten::IntImplicit(Tensor a) -> (int)
aten::FloatImplicit(Tensor a) -> (float)
aten::ScalarImplicit(Tensor a) -> (Scalar)
aten::Bool.Tensor(Tensor a) -> (bool)
aten::Bool.int(int a) -> (bool)
aten::Bool.float(float a) -> (bool)
aten::Float.Tensor(Tensor a) -> (float)
aten::Float.Scalar(Scalar a) -> (float)
aten::Float.int(int a) -> (float)
aten::Float.bool(bool a) -> (float)
aten::Float.str(str a) -> (float)
aten::format(str self, ...) -> (str)
prim::NumToTensor.Scalar(Scalar a) -> (Tensor)
prim::RaiseException(str msg) -> ()
aten::Size(int[] sizes) -> (int[])
aten::size(Tensor self) -> (int[])
prim::TupleIndex(Any tup, int i) -> (Any)
aten::ne.int_list(int[] a, int[] b) -> (bool)
prim::unchecked_unwrap_optional(t(a)? optional) -> (t(a))
prim::device(Tensor a) -> (Device)
prim::dtype(Tensor a) -> (int)
aten::__not__(bool self) -> (bool)
aten::__is__(t1 self, t2 obj) -> (bool)
aten::__isnot__(t1 self, t2 obj) -> (bool)
aten::element_size(Tensor self) -> (int)
aten::numel(Tensor self) -> (int)
aten::dim(Tensor self) -> (int)
aten::get_device(Tensor self) -> (int)
aten::storage_offset(Tensor self) -> (int)
aten::is_contiguous(Tensor self) -> (bool)
aten::select.t(t[](a) list, int idx) -> (t(*))
aten::__getitem__.t(t[](a) list, int idx) -> (t(*))
aten::append.t(t[](a!) self, t(c -> *) el) -> (t[](a!))
aten::reverse.t(t[](a!) self) -> ()
aten::extend.t(t[](a!) self, t[] other) -> ()
aten::copy.t(t[](a) self) -> (t[])
aten::_set_item.t(t[](a!) l, int idx, t(b -> *) el) -> (t[](a!))
aten::clear.t(t[](a!) self) -> ()
aten::Delete.t(t[](a!) self, int idx) -> ()
aten::insert.t(t[](a!) self, int idx, t(b -> *) el) -> ()
aten::pop.t(t[](a!) self, int idx=-1) -> (t(*))
aten::add.t(t[] a, t[] b) -> (t[])
aten::add_.t(t[](a!) self, t[] b) -> (t[])
aten::slice.t(t[] l, int start, int end=9223372036854775807, int step=1) -> (t[])
aten::list.t(t[] l) -> (t[])
aten::mul.left_t(t[] l, int n) -> (t[])
aten::mul.right_(int n, t[] l) -> (t[])
aten::mul_.t(t[](a!) l, int n) -> (t[](a!))
aten::len.t(t[] a) -> (int)
aten::eq.int_list(int[] a, int[] b) -> (bool)
prim::Uninitialized() -> (Any)
prim::Print(...) -> ()
aten::eq.int(int a, int b) -> (bool)
aten::eq.float(float a, float b) -> (bool)
aten::eq.int_float(int a, float b) -> (bool)
aten::eq.float_int(float a, int b) -> (bool)
aten::eq(Scalar a, Scalar b) -> (bool)
aten::eq.str(str a, str b) -> (bool)
aten::ne.int(int a, int b) -> (bool)
aten::ne.float(float a, float b) -> (bool)
aten::ne.int_float(int a, float b) -> (bool)
aten::ne.float_int(float a, int b) -> (bool)
aten::ne(Scalar a, Scalar b) -> (bool)
aten::ne.str(str a, str b) -> (bool)
aten::lt.int(int a, int b) -> (bool)
aten::lt.float(float a, float b) -> (bool)
aten::lt.int_float(int a, float b) -> (bool)
aten::lt.float_int(float a, int b) -> (bool)
aten::lt(Scalar a, Scalar b) -> (bool)
aten::lt.str(str a, str b) -> (bool)
aten::gt.int(int a, int b) -> (bool)
aten::gt.float(float a, float b) -> (bool)
aten::gt.int_float(int a, float b) -> (bool)
aten::gt.float_int(float a, int b) -> (bool)
aten::gt(Scalar a, Scalar b) -> (bool)
aten::gt.str(str a, str b) -> (bool)
aten::le.int(int a, int b) -> (bool)
aten::le.float(float a, float b) -> (bool)
aten::le.int_float(int a, float b) -> (bool)
aten::le.float_int(float a, int b) -> (bool)
aten::le(Scalar a, Scalar b) -> (bool)
aten::le.str(str a, str b) -> (bool)
aten::ge.int(int a, int b) -> (bool)
aten::ge.float(float a, float b) -> (bool)
aten::ge.int_float(int a, float b) -> (bool)
aten::ge.float_int(float a, int b) -> (bool)
aten::ge(Scalar a, Scalar b) -> (bool)
aten::ge.str(str a, str b) -> (bool)
aten::add.int(int a, int b) -> (int)
aten::add.float(float a, float b) -> (float)
aten::add.int_float(int a, float b) -> (float)
aten::add.float_int(float a, int b) -> (float)
aten::add(Scalar a, Scalar b) -> (Scalar)
aten::sub.int(int a, int b) -> (int)
aten::sub.float(float a, float b) -> (float)
aten::sub.int_float(int a, float b) -> (float)
aten::sub.float_int(float a, int b) -> (float)
aten::sub(Scalar a, Scalar b) -> (Scalar)
aten::mul.int(int a, int b) -> (int)
aten::mul.float(float a, float b) -> (float)
aten::mul.int_float(int a, float b) -> (float)
aten::mul.float_int(float a, int b) -> (float)
aten::mul(Scalar a, Scalar b) -> (Scalar)
aten::__and__(bool a, bool b) -> (bool)
aten::__or__(bool a, bool b) -> (bool)
aten::__xor__(bool a, bool b) -> (bool)
aten::remainder.int(int a, int b) -> (int)
aten::remainder.float(float a, float b) -> (float)
aten::remainder.int_float(int a, float b) -> (float)
aten::remainder.float_int(float a, int b) -> (float)
aten::remainder(Scalar a, Scalar b) -> (Scalar)
aten::div.int(int a, int b) -> (float)
aten::div.float(float a, float b) -> (float)
aten::div(Scalar a, Scalar b) -> (float)
aten::floordiv.int(int a, int b) -> (int)
aten::floordiv.float(float a, float b) -> (float)
aten::floordiv.int_float(int a, float b) -> (float)
aten::floordiv.float_int(float a, int b) -> (float)
aten::floordiv(Scalar a, Scalar b) -> (Scalar)
aten::pow.int(int a, int b) -> (float)
aten::pow.float(float a, float b) -> (float)
aten::pow.int_float(int a, float b) -> (float)
aten::pow.float_int(float a, int b) -> (float)
aten::pow(Scalar a, Scalar b) -> (float)
aten::pow.Scalar(Scalar a, Scalar b) -> (Scalar)
aten::pow.int_to_int(int a, int b) -> (int)
prim::min.int(int a, int b) -> (int)
prim::min.float(float a, float b) -> (float)
prim::min.int_float(int a, float b) -> (float)
prim::min.float_int(float a, int b) -> (float)
prim::min(Scalar a, Scalar b) -> (Scalar)
prim::max.int(int a, int b) -> (int)
prim::max.float(float a, float b) -> (float)
prim::max.int_float(int a, float b) -> (float)
prim::max.float_int(float a, int b) -> (float)
prim::max(Scalar a, Scalar b) -> (Scalar)
prim::type(Device self) -> (str)
aten::len.Tensor(Tensor t) -> (int)
aten::index.Tensor_hacked_twin(Tensor self, Tensor[] indices) -> (Tensor)
aten::_index_put_impl_.hacked_twin(Tensor(a!) self, Tensor[] indices, Tensor values, bool accumulate=False, bool unsafe=False) -> (Tensor(a!))
aten::index_put_.hacked_twin(Tensor(a!) self, Tensor[] indices, Tensor values, bool accumulate=False) -> (Tensor(a!))
aten::index_put.hacked_twin(Tensor self, Tensor[] indices, Tensor values, bool accumulate=False) -> (Tensor)
aten::to.prim_Device(Tensor(a) self, Device? device, int? dtype=None, bool non_blocking=False, bool copy=False) -> (Tensor(b|a))
aten::to.prim_dtype(Tensor(a) self, int? dtype=None, bool non_blocking=False, bool copy=False) -> (Tensor(b|a))
prim::is_cuda(Tensor a) -> (bool)
prim::data(Tensor(a) a) -> (Tensor(a))
prim::min.int_list(int[] l, int[] r) -> (int[])
prim::max.int_list(int[] l, int[] r) -> (int[])
prim::min.self_int(int[] self) -> (int)
prim::max.self_int(int[] self) -> (int)
prim::min.float_list(float[] l, float[] r) -> (float[])
prim::max.float_list(float[] l, float[] r) -> (float[])
prim::min.self_float(float[] self) -> (float)
prim::max.self_float(float[] self) -> (float)
prim::min.bool_list(bool[] l, bool[] r) -> (bool[])
prim::max.bool_list(bool[] l, bool[] r) -> (bool[])
prim::min.self_bool(bool[] self) -> (bool)
prim::max.self_bool(bool[] self) -> (bool)
aten::len.Dict_str(Dict(str, t) self) -> (int)
aten::keys.str(Dict(str, t) self) -> (str[](*))
aten::values.str(Dict(str, t) self) -> (t[](*))
aten::__getitem__.Dict_str(Dict(str, t) self, str key) -> (t(*))
aten::get.str(Dict(str, t) self, str key) -> (t(*)?)
aten::get.default_str(Dict(str, t) self, str key, t default_value) -> (t(*))
aten::setdefault.str(Dict(str, t)(a!) self, str(b -> *) key, t(c -> *) default_value) -> (t(*))
aten::Delete.Dict_str(Dict(str, t)(a!) self, str key) -> ()
aten::pop.Dict_str(Dict(str, t)(a!) self, str key) -> (t(*))
aten::pop.Dict_default_str(Dict(str, t)(a!) self, str key, t default_value) -> (t(*))
aten::popitem.str(Dict(str, t)(a!) self) -> ((str, t))
aten::clear.str(Dict(str, t)(a!) self) -> ()
aten::update.str(Dict(str, t)(a!) self, Dict(str, t)(a!) to_add) -> ()
aten::items.str(Dict(str, t) self) -> ((str, t)[])
aten::copy.Dict_str(Dict(str, t)(a) self) -> (Dict(str, t))
aten::__contains__.str(Dict(str, t) dict, str key) -> (bool)
aten::_set_item.str(Dict(str, t)(a!) l, str(b -> *) idx, t(c -> *) v) -> ()
aten::dict.str((str, tVal)[] inputs) -> (Dict(str, tVal))
aten::len.Dict_int(Dict(int, t) self) -> (int)
aten::keys.int(Dict(int, t) self) -> (int[](*))
aten::values.int(Dict(int, t) self) -> (t[](*))
aten::__getitem__.Dict_int(Dict(int, t) self, int key) -> (t(*))
aten::get.int(Dict(int, t) self, int key) -> (t(*)?)
aten::get.default_int(Dict(int, t) self, int key, t default_value) -> (t(*))
aten::setdefault.int(Dict(int, t)(a!) self, int(b -> *) key, t(c -> *) default_value) -> (t(*))
aten::Delete.Dict_int(Dict(int, t)(a!) self, int key) -> ()
aten::pop.Dict_int(Dict(int, t)(a!) self, int key) -> (t(*))
aten::pop.Dict_default_int(Dict(int, t)(a!) self, int key, t default_value) -> (t(*))
aten::popitem.int(Dict(int, t)(a!) self) -> ((int, t))
aten::clear.int(Dict(int, t)(a!) self) -> ()
aten::update.int(Dict(int, t)(a!) self, Dict(int, t)(a!) to_add) -> ()
aten::items.int(Dict(int, t) self) -> ((int, t)[])
aten::copy.Dict_int(Dict(int, t)(a) self) -> (Dict(int, t))
aten::__contains__.int(Dict(int, t) dict, int key) -> (bool)
aten::_set_item.int(Dict(int, t)(a!) l, int(b -> *) idx, t(c -> *) v) -> ()
aten::dict.int((int, tVal)[] inputs) -> (Dict(int, tVal))
aten::len.Dict_float(Dict(float, t) self) -> (int)
aten::keys.float(Dict(float, t) self) -> (float[](*))
aten::values.float(Dict(float, t) self) -> (t[](*))
aten::__getitem__.Dict_float(Dict(float, t) self, float key) -> (t(*))
aten::get.float(Dict(float, t) self, float key) -> (t(*)?)
aten::get.default_float(Dict(float, t) self, float key, t default_value) -> (t(*))
aten::setdefault.float(Dict(float, t)(a!) self, float(b -> *) key, t(c -> *) default_value) -> (t(*))
aten::Delete.Dict_float(Dict(float, t)(a!) self, float key) -> ()
aten::pop.Dict_float(Dict(float, t)(a!) self, float key) -> (t(*))
aten::pop.Dict_default_float(Dict(float, t)(a!) self, float key, t default_value) -> (t(*))
aten::popitem.float(Dict(float, t)(a!) self) -> ((float, t))
aten::clear.float(Dict(float, t)(a!) self) -> ()
aten::update.float(Dict(float, t)(a!) self, Dict(float, t)(a!) to_add) -> ()
aten::items.float(Dict(float, t) self) -> ((float, t)[])
aten::copy.Dict_float(Dict(float, t)(a) self) -> (Dict(float, t))
aten::__contains__.float(Dict(float, t) dict, float key) -> (bool)
aten::_set_item.float(Dict(float, t)(a!) l, float(b -> *) idx, t(c -> *) v) -> ()
aten::dict.float((float, tVal)[] inputs) -> (Dict(float, tVal))
aten::len.Dict_Tensor(Dict(Tensor, t) self) -> (int)
aten::keys.Tensor(Dict(Tensor, t) self) -> (Tensor[](*))
aten::values.Tensor(Dict(Tensor, t) self) -> (t[](*))
aten::__getitem__.Dict_Tensor(Dict(Tensor, t) self, Tensor key) -> (t(*))
aten::get.Tensor(Dict(Tensor, t) self, Tensor key) -> (t(*)?)
aten::get.default_Tensor(Dict(Tensor, t) self, Tensor key, t default_value) -> (t(*))
aten::setdefault.Tensor(Dict(Tensor, t)(a!) self, Tensor(b -> *) key, t(c -> *) default_value) -> (t(*))
aten::Delete.Dict_Tensor(Dict(Tensor, t)(a!) self, Tensor key) -> ()
aten::pop.Dict_Tensor(Dict(Tensor, t)(a!) self, Tensor key) -> (t(*))
aten::pop.Dict_default_Tensor(Dict(Tensor, t)(a!) self, Tensor key, t default_value) -> (t(*))
aten::popitem.Tensor(Dict(Tensor, t)(a!) self) -> ((Tensor, t))
aten::clear.Tensor(Dict(Tensor, t)(a!) self) -> ()
aten::update.Tensor(Dict(Tensor, t)(a!) self, Dict(Tensor, t)(a!) to_add) -> ()
aten::items.Tensor(Dict(Tensor, t) self) -> ((Tensor, t)[])
aten::copy.Dict_Tensor(Dict(Tensor, t)(a) self) -> (Dict(Tensor, t))
aten::__contains__.Tensor(Dict(Tensor, t) dict, Tensor key) -> (bool)
aten::_set_item.Tensor(Dict(Tensor, t)(a!) l, Tensor(b -> *) idx, t(c -> *) v) -> ()
aten::dict.Tensor((Tensor, tVal)[] inputs) -> (Dict(Tensor, tVal))
aten::split(Tensor self, int[] split_sizes, int dim=0) -> (Tensor[])
aten::tensor.float(float t, *, int? dtype=None, Device? device=None, bool requires_grad=False) -> (Tensor)
aten::as_tensor.float(float t, *, int? dtype=None, Device? device=None) -> (Tensor)
aten::tensor.int(int t, *, int? dtype=None, Device? device=None, bool requires_grad=False) -> (Tensor)
aten::as_tensor.int(int t, *, int? dtype=None, Device? device=None) -> (Tensor)
aten::tensor.bool(bool t, *, int? dtype=None, Device? device=None, bool requires_grad=False) -> (Tensor)
aten::as_tensor.bool(bool t, *, int? dtype=None, Device? device=None) -> (Tensor)
aten::_infer_size(int[] a, int[] b) -> (int[])
aten::_no_grad_embedding_renorm_(Tensor weight, Tensor input, float max_norm, float norm_type) -> (Tensor)
aten::tensor(t[] data, *, int? dtype=None, Device? device=None, bool requires_grad=False) -> (Tensor)
aten::as_tensor(Tensor(a) data, *, int? dtype=None, Device? device=None) -> (Tensor(b|a))
aten::as_tensor.list(t[] data, *, int? dtype=None, Device? device=None) -> (Tensor)
aten::_pack_sequence(Tensor output, Tensor batch_sizes, Tensor? sorted_indices, Tensor? unsorted_indices) -> (Tensor, Tensor, Tensor?, Tensor?)
aten::_get_tracing_state() -> (bool)
aten::is_scripting() -> (bool)
aten::_no_grad_uniform_(Tensor(a!) tensor, float a, float b) -> (Tensor(a!))
aten::_no_grad_normal_(Tensor(a!) tensor, float mean, float std) -> (Tensor(a!))
aten::_no_grad_fill_(Tensor(a!) tensor, float val) -> (Tensor(a!))
aten::_no_grad_zero_(Tensor(a!) tensor) -> (Tensor(a!))

```

Reviewed By: iseeyuan

Differential Revision: D21915144

fbshipit-source-id: 35faac8db03931aebad6089488ef6ca691d230d9
2020-06-06 23:42:18 -07:00
b28422d444 add overload name for str cmp (#39607)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39607

add overload name for strcmp macro to prevent duplicated op names in lite interpreter

also reformatted some other files

Test Plan:
verified these op schema are changed

```
-aten::eq(str a, str b) -> (bool)
+aten::eq.str(str a, str b) -> (bool)

-aten::ne(str a, str b) -> (bool)
+aten::ne.str(str a, str b) -> (bool)

-aten::lt(str a, str b) -> (bool)
+aten::lt.str(str a, str b) -> (bool)

-aten::gt(str a, str b) -> (bool)
+aten::gt.str(str a, str b) -> (bool)

-aten::le(str a, str b) -> (bool)
+aten::le.str(str a, str b) -> (bool)

-aten::ge(str a, str b) -> (bool)
+aten::ge.str(str a, str b) -> (bool)
```

Reviewed By: iseeyuan

Differential Revision: D21913049

fbshipit-source-id: 518db068c8c5b0efd19223f0bd94fc3351335dc4
2020-06-06 23:21:35 -07:00
479b04e26a Improve DistributedSampler docs and add seed option (#39628)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/39628

Differential Revision: D21920373

Pulled By: mrshenli

fbshipit-source-id: d7d1005db6feef4a83a1a094b85fcff964bd0ac6
2020-06-06 14:24:22 -07:00
f2af07d7f6 Fix circleci postnightly jobs (#39627)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/39626.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39627

Differential Revision: D21920220

Pulled By: seemethere

fbshipit-source-id: 0cd2aa10f01b3f65ca4c330ff8bdf941824b7be3
2020-06-06 10:12:24 -07:00
53c19423cf Update TensorPipe submodule (#39598)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39598

In order to include af5f68b241

Test Plan: CircleCI

Reviewed By: mrshenli

Differential Revision: D21910997

fbshipit-source-id: 98ac0a9431576e2984c0cac99cc83f7ba967ccde
2020-06-06 00:36:37 -07:00
6a75f650dd Implement Quantized Version of Threshold Function (#39352)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39352

In this task, the quantized backend of the kernel is implemented for the threshold function, which clamps the entries in a tensor less than or equal to  a given threshold to be a specified value.

The corresponding Python implementation and unit test are also added.

Test Plan:
1. On a devserver, build PyTorch from source by running the command `buck build mode/dev //caffe2:torch`
2. Run the unit test throught the command
`buck test mode/dev //caffe2/test:quantization -- test_qthreshold`

Reviewed By: z-a-f

Differential Revision: D21822446

fbshipit-source-id: e8c869664e6d4c664f0e7fa3957762992118c082
2020-06-05 23:07:48 -07:00
3669e45736 [jit][subgraph_matcher] Enable regex matching for string attributes of node (#39454)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/39454

Test Plan: Imported from OSS

Differential Revision: D21876224

fbshipit-source-id: c0fdff3a4532d2a73b222353e2cad6cf52444697
2020-06-05 23:03:38 -07:00
856215509d [jit] update to serialization doc (#39025)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/39025

Test Plan: Imported from OSS

Differential Revision: D21911710

Pulled By: wanchaol

fbshipit-source-id: e3c346feef2ddc36c671d5e1469702854dbfebb3
2020-06-05 17:49:08 -07:00
834569232b [online trainer] Add blob reorder (#39534)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/39534

Reviewed By: boryiingsu

Differential Revision: D21871352

fbshipit-source-id: 00cce83b7351fdafd36d4db57c99fb8a58e8a260
2020-06-05 17:33:08 -07:00
e29d873e68 disable autograd while preparing Tensor for printing (#39420)
Summary:
Minor speed up when printing.
Also allows you to print Tensors that you cannot perform autograd ops on.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39420

Differential Revision: D21889390

Pulled By: albanD

fbshipit-source-id: 4e229994eb89484795282e6eac37359ce46b5ebc
2020-06-05 16:57:48 -07:00
e35199a691 observer bench: add CUDA (#39360)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39360

Makes the observer microbenchmarks also run on CUDA. This is useful
now that QAT is supported in DDP and is more likely to be run
on GPUs.

Test Plan:
```
python -m pt.qobserver_test
```

Imported from OSS

Differential Revision: D21828985

fbshipit-source-id: 6da4d61f744f7a2ee5e87963b3ec84579128d435
2020-06-05 14:18:32 -07:00
545a3e1eca Remove test_nccl from ROCM_BLACKLIST and enable only a couple of test_nccl tests (#39354)
Summary:
All individual test_nccl unit tests have been disabled for ROCm in bf9395438f
test_nccl was also added to the ROCM_BLACKLIST in 87b198d309
However, the issue only arises when running the test_nccl suite as a whole (as opposed to any one test individually). More details in comments here: https://github.com/pytorch/pytorch/pull/38689

This PR enables test_nccl suite with only two tests so as to workaround the as-yet unresolved issue above, while allowing at least one test_nccl collective test to run on ROCm. This is also needed as a precursor for: https://github.com/pytorch/pytorch/pull/38515
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39354

Differential Revision: D21843194

Pulled By: mrshenli

fbshipit-source-id: b28d1e073d8d0fdc1b59928fc3b00187cfd02a35
2020-06-05 13:52:23 -07:00
97a2918a07 reduce number of bailout nodes (#38281)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/38281

Differential Revision: D21665509

Pulled By: Krovatkin

fbshipit-source-id: c2c34b759aec30d0a161e582030ba994192ee4ec
2020-06-05 13:45:37 -07:00
88fe05e106 [Docs] Update torch.(squeeze, split, set_printoptions, save) docs. (#39303)
Summary:
I added the following to the docs:
1. `torch.save`.
    1. Added doc for `_use_new_zipfile_serialization` argument.
    2. Added a note telling that extension does not matter while saving.
    3. Added an example showing the use of above argument along with `pickle_protocol=5`.

2. `torch.split`
    1. Added an example showing the use of the function.

3. `torch.squeeze`
   1. Added a warning for batch_size=1 case.

4. `torch.set_printoptions`
    1. Changed the docs of `sci_mode` argument from
        ```
        sci_mode: Enable (True) or disable (False) scientific notation. If
                 None (default) is specified, the value is defined by `_Formatter`
        ```
        to
        ```
        sci_mode: Enable (True) or disable (False) scientific notation. If
                 None (default=False) is specified, the value is defined by
                `torch._tensor_str._Formatter`.
        ```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39303

Differential Revision: D21904504

Pulled By: zou3519

fbshipit-source-id: 92a324257d09d6bcfa0b410d4578859782b94488
2020-06-05 12:57:53 -07:00
0031108b60 Support torch.Tensor subclass (like Parameter) input. (#39487)
Summary:
Currently torch.Tensor subclasses (like torch.nn.Parameter) isn't a supported type annotation to torch script inputs. This PR allows it to be treated like torch.Tensor for compilation.

Closes https://github.com/pytorch/pytorch/issues/38235
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39487

Differential Revision: D21885827

Pulled By: gmagogsfm

fbshipit-source-id: 1ec51829b132b7b0293a6c526d73497b23dae113
2020-06-05 11:58:20 -07:00
a6690bdb5b fix input schema check for spatialbn
Summary:
we were restricting it to 3, but in training we set up to 5, even that
in practice we just need 3 since we don't recompute mean/var

Test Plan: contrib tests for fakelowp

Reviewed By: hl475

Differential Revision: D21905490

fbshipit-source-id: 48f61c7ba7d95f19d55d2f65514a517c1514ae88
2020-06-05 10:30:44 -07:00
51504cb8dd Fix IDE hint channels_last & preserve_format (#39120)
Summary:
- Fixing the nit introduced in https://github.com/pytorch/pytorch/issues/38784 .
- [torch.preserve_format] does not show IDE hint either, it would be fixed here as well.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39120

Differential Revision: D21904575

Pulled By: ezyang

fbshipit-source-id: 80fa1e838e0c444d7b1f2d45e649b51d6c38b54d
2020-06-05 09:48:05 -07:00
77798a45a6 Un-inline Functions.h into Functions.cpp (#39446)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39446

In my unscientific testing, this reduces startup time by 50% on gcc 8.3.
That's a big fucking deal.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Differential Revision: D21862037

Pulled By: ezyang

fbshipit-source-id: 69fb401956304a97f8f80c48cecdb1cb199ff434
2020-06-05 09:12:34 -07:00
e2a178ca21 Update cafe2 hypothesis_test_util to support hypothesis-5 (#39498)
Summary:
Extracting forward-backward `hypothesis` interface update  parts of https://github.com/pytorch/pytorch/pull/39430 into a separate PR
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39498

Differential Revision: D21900210

Pulled By: malfet

fbshipit-source-id: 75e637cf839f49dc141d37e1686ce45ff4721245
2020-06-05 08:27:50 -07:00
baf6ed0238 Release GIL when deleting users and unforked owners (#39555)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39555

This function does not require GIL, as all OwnerRRef-related
py::object deletion is now guarded by ConcretePyObjectHolder. If
we hold lock here, we could potentially run into deadlock, if there
are other threads in the RPC thread pool trying to acquire GIL to
destruct Python UDFs or OwnerRRefs.

Test Plan: Imported from OSS

Differential Revision: D21897125

Pulled By: mrshenli

fbshipit-source-id: 96157689df38bc409af57b83248ae73823d1f959
2020-06-05 06:51:53 -07:00
9bfb91b50b Fix possible deadlock in _wait_all_workers (#39535)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39535

This is my understanding of what could happen: on workerN (N != 0), `_wait_all_workers_sequence_id_to_states`, which is a `defaultdict`, is accessed twice: once in the body of `_wait_all_workers` (by the "main thread" of workerN) and once in `_set_proceed_shutdown_signal`, called by worker0 through a RPC call. I think the two could race and access the `_wait_all_workers_sequence_id_to_states` at the same time, and thus create two separate copies of `WaitAllWorkersStates`. One of those threads would wait  on the event of one copy, but the other thread would set the event of the other copy. This lead to a deadlock, as the main thread would end up waiting forever.
ghstack-source-id: 105283327

Test Plan: I added additional logging in those functions, ran a stress test of the RPC test suite, based on the logs I suspected that this could be the issue, fixed it and re-run the stress test and didn't see the bug anymore. This is admittedly not very convincing evidence, as I may just have been lucky that second time...

Differential Revision: D21889752

fbshipit-source-id: 05ec710bd2930313e1480ae896b4b2f5f503aa17
2020-06-05 02:42:32 -07:00
8a6914ddb2 Add @rpc.functions.async_execution for rpc.remote (#39486)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/39486

Test Plan: Imported from OSS

Differential Revision: D21871422

Pulled By: mrshenli

fbshipit-source-id: 3c432b7718a47732b2aee064c554f6bdcc5c95c1
2020-06-04 22:38:35 -07:00
11abb75362 Make @rpc.functions.async_execution processing generic (#39485)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/39485

Test Plan: Imported from OSS

Differential Revision: D21871421

Pulled By: mrshenli

fbshipit-source-id: d0e4e82a9098cad364ecbcecff76091155cbda23
2020-06-04 22:38:29 -07:00
fa4ed17183 Explicitly decref in UnpickledPythonCall dtor (#38398)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/38398

Test Plan: Imported from OSS

Differential Revision: D21550712

Pulled By: mrshenli

fbshipit-source-id: aac4708a5b6f6dc38149f995d11e27c190648859
2020-06-04 22:36:35 -07:00
876b9591dc Refactor unittests for activation functions relu, elu, and sigmoid (#39190)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39190

The tests covered previously by test_qrelu, test_qrelu6, test_qsigmoid, and test_qhardsigmoid are now merged into one test to ensure conciseness and reduce redundancy.

The refactoring aims to provide the basis for a more generalizable framework to test quantized activation functions and more in the future.

Test Plan:
1. On a devserver, build PyTorch from source by running the command "buck build mode/dev //caffe2:torch"
 2. Run the merged unit test throught the command
"buck test mode/dev //caffe2/test:quantization -- test_qrelu"
"buck test mode/dev //caffe2/test:quantization -- test_qrelu6"
"buck test mode/dev //caffe2/test:quantization -- test_qsigmoid"
"buck test mode/dev //caffe2/test:quantization -- test_qhardsigmoid"

Reviewed By: z-a-f

Differential Revision: D21755690

fbshipit-source-id: ef62b2a50ee1c3b8607746f47fb587561e75ff25
2020-06-04 19:50:36 -07:00
7d56ef27ee Bumps supported file format in anticipate of torch.div changes (#39529)
Summary:
See https://github.com/pytorch/pytorch/pull/38620 for additional context.

When PyTorch begins producing file format 4 with the updated div behavior it's safe for older PyTorch versions to consume it, since file format 4 only prohibits functionality. Bumping the supported file format version now gives PyTorch users on Master some leeway on updating their services that consume vs. produce PyTorch.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39529

Differential Revision: D21886790

Pulled By: mruberry

fbshipit-source-id: d6098eff06c26f18c3fac5cc85e5db298ba86e27
2020-06-04 19:34:00 -07:00
17aebe909f Added Operator_Schema's for missing FakeFP16 Operators (#39363)
Summary:
Added Operator_Schema's for missing FakeFP16 Operators
Reviewer: hyz
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39363

Differential Revision: D21885816

Pulled By: hyuen

fbshipit-source-id: aa6abd984df40660ab59a37c9898fbac430866da
2020-06-04 19:04:27 -07:00
f94a171e6f [quant][graphmode] Test for another type of ops in insert_observer for if (#39380)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39380

Test for inserting observers for if statement for ops that propagate quantization parameters

Test Plan: Imported from OSS

Differential Revision: D21832477

fbshipit-source-id: 6e0b4ce4a89f847af161bb22338525802adb8b41
2020-06-04 17:36:28 -07:00
da8191a9ad Remove useless copy on zip file load (#36362)
Summary:
Instead of copying to a buffer, then setting a tensor's storage with that buffer, create a storage directly from the file

Pull Request resolved: https://github.com/pytorch/pytorch/pull/36362

Pulled By: driazati

Differential Revision: D21889537

fbshipit-source-id: edbd430073c2bbf52332fe7b3b2590e7d936dedf
2020-06-04 16:59:54 -07:00
ed12df64ca misc updates to fake fp16 tests (#39405)
Summary:
Misc updates to the fake FP16 tests.
1. seeding numpy with a random seed
2. test base class changed from unittest.TestCase=>serial.SerializedTestCase
3. Removed the hypothesis_test_util import
Reviewer: Hector Yuen
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39405

Test Plan: Fake FP16 test

Differential Revision: D21890212

Pulled By: hyuen

fbshipit-source-id: 25e7e17f118655f32cdd06ea9db3cdac5277e649
2020-06-04 15:22:18 -07:00
2a513a6a2b Do not raise decorator (#39532)
Summary:
s/raise unittest.skip/raise unittest.SkipTest/
As `unittest.skip` is a decorator while `unittest.SkipTest` is an exception
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39532

Differential Revision: D21889152

Pulled By: malfet

fbshipit-source-id: 27a03dbf065a1e2712a63c6c27e156bd13edbbdf
2020-06-04 14:06:19 -07:00
b861daf098 Reduce time spent per guard by comparing TensorType with Tensor (#39098)
Summary:
It mainly reduces the time spent on allocating new TensorType object for Tensor, but comparing them directly.
benchmark result before and after this PR: https://gist.github.com/ailzhang/db44d0a1911cae62e0bb794bff33f40a
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39098

Differential Revision: D21786678

Pulled By: ailzhang

fbshipit-source-id: 2f61f0ac1dc8c529c45bef4e149be431ff1608b0
2020-06-04 13:50:18 -07:00
8811e4d00d Add/fix typing annotations to some functions (#39075)
Summary:
Add missing typing imports to some jit tests
Add typing annotations to `torch.testing._compare_scalars_internal` and `torch.testing._internal.assertTrue`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39075

Differential Revision: D21882468

Pulled By: malfet

fbshipit-source-id: dd9858eb8e11a38411544cc64daf36fced807d76
2020-06-04 13:40:04 -07:00
da2f8c9f1f deepcopy() of Objects should call __g/setstate__ (#39500)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/39500

Test Plan: Imported from OSS

Reviewed By: suo

Differential Revision: D21875091

Pulled By: jamesr66a

fbshipit-source-id: 105875dd220a91bc4fcb8fcfb77fab8b626eb6cb
2020-06-04 13:18:00 -07:00
4e5af8d146 [ONNX] Fix type casting for reduce ops (#38829)
Summary:
Fix type casting for reduce ops in ONNX exporter. PyTorch promotes dtypes bool and all integer types to long for these ops.

This fix only covers traced modules where dtype is present
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38829

Reviewed By: hl475

Differential Revision: D21833533

Pulled By: houseroad

fbshipit-source-id: 00d9ff692cc0b09d6ca169f6c63913f04b56f182
2020-06-04 13:04:09 -07:00
da2004e132 Upgrade lint. (#39483)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39483

I fixed all of the new errors that occurred because of the upgrade.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Differential Revision: D21884575

Pulled By: ezyang

fbshipit-source-id: 45c8e1f1ecb410c8d7c46dd3922ad70e982a0685
2020-06-04 12:56:43 -07:00
fe684679b0 Fix overflow issues when unpacking large numbers (#39140)
Summary:
Resolve https://github.com/pytorch/pytorch/issues/33111

relax the overflow and precision lost checks when unpacking doubles.

Signed-off-by: Xiong Wei <xiongw.fnst@cn.fujitsu.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39140

Differential Revision: D21885217

Pulled By: ezyang

fbshipit-source-id: e2bbe90d719443ea2e1c6b7b2c637f9a943fa5c0
2020-06-04 12:24:24 -07:00
49b69b2ade [JIT] fix broadcasting lists of ints (#39481)
Summary:
Previously, on conversion from python -> c++ it was casted to double list through bad copy pasta. It's pretty unusual for someone to script a broadcasting list function directly since it's an internal api, so it was unlikely to affect anyone.

Fix for https://github.com/pytorch/pytorch/issues/39450
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39481

Reviewed By: jamesr66a

Differential Revision: D21870557

Pulled By: eellison

fbshipit-source-id: e704e5e87d2702a270b7d65c4df444246a134480
2020-06-04 12:16:41 -07:00
7676aa79ec .circleci: Move binary builds into their own workflow (#39379)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39379

Moves binary builds into their own workflow and adds the ability to
target specification on them. This allows you to run the binary build
workflow on a pull request without the need to modify any configuration
at all.

Some notes about this implementation:
* Upload jobs are still restricted to only the nightly branches and RC tags
* Parameters for circleci are currently defined in
  .circleci/verbatim-sources/header-section.yml
* Target specification configuration is currently located at
  .github/pytorch-circleci-labels.yml

Signed-off-by: Eli Uriegas <eliuriegas@fb.com>

Test Plan: Imported from OSS

Differential Revision: D21886341

Pulled By: seemethere

fbshipit-source-id: 146ef5df2fea208d33e97862d52c170bf001bc98
2020-06-04 12:06:23 -07:00
eb5e0376a2 Selective enabling of xnnpack based max_pool2d in ceil_mode. (#39447)
Summary:
max_pool2d with ceil_mode calculates output size a little differently
than what we get with xnnpack max_pool2d. Thus when ceil_mode=True, we
disable this path. However if we get the same output size with ceil_mode
and without ceil_mode, we should use xnnpack based max_pool2d.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39447

Test Plan: CI

Differential Revision: D21873334

Pulled By: kimishpatel

fbshipit-source-id: b84abed1505e36e492cc87e7d40664ac63964909
2020-06-04 11:59:08 -07:00
7680358122 Move some of the definitions in LegacyNNDefinitions.cpp closer to sites (#37531)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37531

All of these definitions are no longer "legacy" as their CPU
implementations have been ported to ATen.  There are probably some
layers of indirection that could be reduced here, but for now just do a
minor but unlikely to break things cleanup.

The last thing in LegacyNNDefinitions truly is still in THCUNN and can't
be removed.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Differential Revision: D21310913

Pulled By: ezyang

fbshipit-source-id: 1ff4ff16abddf13f8d583df990242ac4b0461915
2020-06-04 11:52:03 -07:00
335e4a1e3b Add arcosh, arcsinh and arctanh to unary ops (#38388)
Summary:
This PR aims to add `arcosh`, `arcsinh` and `arctanh` support. Please see issue https://github.com/pytorch/pytorch/issues/38349 for more details.

**TODOs:**

* [x] Add test cases for `arcosh`, `arcsinh` and `arctanh`. (need help)
* [x] Overload ops if `std::op` does not work with `thrust::complex` types (like for `sinh`, `cosh`).

Note: `std::acosh, std::asinh, std::atanh` do not support `thrust::complex` types. Added support for complex types for these 3 ops (`arccosh, arcsinh, arctanh`)

cc: mruberry
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38388

Differential Revision: D21882055

Pulled By: mruberry

fbshipit-source-id: d334590b47c5a89e491a002c3e41e6ffa89000e3
2020-06-04 11:40:55 -07:00
ada2652ca6 Restore docs coverage test via sphinx (#39331)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39331

Fixes gh-37590

Adds an extra `make coverage` to document building, which uses the built-in facility in sphinx to check docstring coverage. Also fixes a failure to import `torch/jit/supported_ops.py` which broke the [Torchscript Builtins](https://pytorch.org/docs/stable/jit_builtin_functions.html) page.

This also adds the required `SPHINXOPTS` to turn warnings into error, but this is commented out. Note that since documentation of `torchvision` is merged in here, failures there would cause failures here if this is made active. Some thought might be needed about pinning the torchvision version merged into documentation.

The first commit should fail, since the "ScriptModule" class is commented out. I did that in order to check that a CI failure is properly reported.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38244

Differential Revision: D21640589

Pulled By: ezyang

fbshipit-source-id: 1e240d81669b5f21404d596de4a27d192dc9fd8a
2020-06-04 10:49:38 -07:00
b4aceb3884 Fix lint (#39527)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39527

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Differential Revision: D21884798

Pulled By: ezyang

fbshipit-source-id: a130bfd4cc122ea1d45e7db7303bf44e04f08703
2020-06-04 10:30:44 -07:00
af91df68ed Remove cuda init patch (#39222)
Summary:
The below lines have been removed from `torch/cuda/__init__.py` anyway:
```
        _cudart = _load_cudart()
        _cudart.cudaGetErrorName.restype = ctypes.c_char_p
        _cudart.cudaGetErrorString.restype = ctypes.c_char_p
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39222

Differential Revision: D21864397

Pulled By: yns88

fbshipit-source-id: 941b13f92192f930e1dfa4b385e1aec2e321e75f
2020-06-04 09:31:34 -07:00
ac25267753 fix build table for ppc64le (#39475)
Summary:
This corrects the build info for ppc64le in the main README.

I am opening this PR before renaming the build job.  (So, the "live" master README has the correct "live" link and the PR does not.)
Immediately after submitting the PR, I will correct the name of the build job.  This will make the new PR link correct, and the current "master" link will briefly appear broken until this PR gets merged.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39475

Differential Revision: D21883184

Pulled By: malfet

fbshipit-source-id: 148353b632448c98e5aff560d31642328afe7963
2020-06-04 08:31:38 -07:00
002b19da92 Add SymbolicShape and replace all uses of VaryingShape<ShapeSymbol> with it (#38544)
Summary:
Adding a SymbolicShape class to represent a generic tensor shape with ShapeSymbols.

Its core data structure is c10::optional<std::vector<ShapeSymbol>>. If has_value() == false, it represents an unranked tensor shape. At any dimension ShapeSymbol can contain dynamic size, checkable with ShapeSymbol::IsStatic method.

SymbolicShape now replaces all uses of VaryingShape<ShapeSymbol>, ie c10::optional<std::vector<c10::optional<ShapeSymbol>>>. The inner c10::optional wrapper around ShapeSymbol used to indicate dynamic shape, which overlaps with part of ShapeSymbol's representation.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38544

Reviewed By: ZolotukhinM

Differential Revision: D21693984

Pulled By: gmagogsfm

fbshipit-source-id: 6e633e4f36cf570d6fb34ac15d00ec1fb2054a09
2020-06-04 06:37:39 -07:00
11a60b9942 Clean up thrust::complex from rsqrt (#39294)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/39294

Differential Revision: D21818288

Pulled By: anjali411

fbshipit-source-id: ee7758872700a93713ab66565e2a7a9e8a088a94
2020-06-04 06:09:23 -07:00
92c6776761 Fix lint (#39517)
Summary:
Fixes lint.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39517

Reviewed By: lw

Differential Revision: D21881495

Pulled By: mruberry

fbshipit-source-id: 43b06466d9311d16b0d78d58ed124c1f01807443
2020-06-04 04:57:34 -07:00
8b2bb02e09 Implement timeout support for RRefs (#38590)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38590

This PR implements timeout semantics for RRef for parity with rpc_sync and rpc_async. How it works:

- Timeout parameter is added to rpc.remote. If the rpc.remote call times out, note that the error won't be raised to the user in that call, as it is not blocking (similar to rpc_async). Instead, the timeout error will be raised the next time the RRef is used (either by pickling or to_here call).
- Error handling semantics are added to RRef to deal with the timeout errors. Previously, if there was an error creating the OwnerRRef, the callback on the local user would throw an error in a callback, resulting in an `std::terminate`. Instead of this, the error is now caught and surfaced to the user the next time the RRef is used. As part of this, we have added an `RPCErrorType` enum and defined RRef error handlers to handle the `RPCErrorrTypes` (currently just timeout and unknown)
- A timeout parameter is added to `to_here()` which gives the user control over the max amount of time it can block for.
- `ctx.prepareChildForFork()` which is called when the RRef is pickled (i.e. used as an arg over RPC) checks if the `rpc.remote()` call had timed out, and if so, raises that error to the user.
- Tests are added, primarily via delay injection.
ghstack-source-id: 105232837

Test Plan: CI

Differential Revision: D21588165

fbshipit-source-id: c9f9e8aa3521012ea1de3e0f152a41afdf8b23f3
2020-06-04 02:14:42 -07:00
72b0447f8d [pytorch] move tracing logic to a separate dispatch backend (#38467)
Summary:
This PR moves tracing logic out of the generated VariableType kernels, to associate it with a new dedicated dispatch key Tracer.
It also toggles the dispatch key set at various places to keep the semantics unchanged - see the inline [Tracing Mode Switches] note.

Sample generated code:
```
Tensor & __ilshift___Tensor(Tensor & self, const Tensor & other) {
  #if !defined(PYTORCH_DISABLE_TRACING)
  torch::jit::Node* node = nullptr;
  std::shared_ptr<jit::tracer::TracingState> tracer_state;
  if (jit::tracer::isTracing()) {
    tracer_state = jit::tracer::getTracingState();
    at::Symbol op_name;
    op_name = jit::Symbol::fromQualString("aten::__ilshift__");
    node = tracer_state->graph->create(op_name, /*num_outputs=*/0);
    jit::tracer::recordSourceLocation(node);
    jit::tracer::addInputs(node, "self", self);
    jit::tracer::addInputs(node, "other", other);
    tracer_state->graph->insertNode(node);

    jit::tracer::setTracingState(nullptr);
  }
  #endif
  static auto op = c10::Dispatcher::singleton().findSchemaOrThrow("aten::__ilshift__", "Tensor");
  c10::Dispatcher::singleton().redispatch<Tensor &, Tensor &, const Tensor &>(op, c10::DispatchKey::Tracer, self, other);
  #if !defined(PYTORCH_DISABLE_TRACING)
  if (tracer_state) {
    jit::tracer::setTracingState(std::move(tracer_state));
    jit::tracer::addOutput(node, self);
  }
  #endif
  return self;
}
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/38467

ghstack-source-id: 105215150

Test Plan: CI

Differential Revision: D21570684

fbshipit-source-id: 1a96761830307f9a934f38bfb9fe8b5b1763e0e0
2020-06-04 01:51:30 -07:00
03eca384fd Optimize GroupNorm on CPU (#28203)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/28203

Optimize GroupNorm on CPU
ghstack-source-id: 105149765

Test Plan: buck test mode/dev-nosan caffe2/test:nn -- "GroupNorm"

Reviewed By: houseroad

Differential Revision: D17901506

fbshipit-source-id: 5eb22ad0e8a9ab2533282b967b2818f690e48865
2020-06-03 23:52:16 -07:00
4a0a38c17a Revert D21652452: [pytorch][PR] Fix for num_threads==1 in OpenMP "parallel for"
Test Plan: revert-hammer

Differential Revision:
D21652452

Original commit changeset: 2cda7777c0ea

fbshipit-source-id: fdd9a0346ce32a962766f57e13357dd2bc60d8b8
2020-06-03 22:51:51 -07:00
67cea74dd3 Add rpc.async_function decorator for TorchScript functions (#39267)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39267

When combined with `torch.jit.script`, the order of decorators matter.
`rpc.functions.async_execution` must be the outmost one. The
`async_execution` decorator will store the TorchScript function in
attribute `_wrapped_async_rpc_function` on the wrapper function, and
pass this wrapped TorchScript function (i.e., an instance of
`torch.jit.ScriptFunction`) to RPC. The caller will mark the ScriptCall
with `isAsyncExecution=true`, and the callee will extract the returned
`Future` in C++ and install subsequent processing as a callback to
that `Future`.

Test Plan: Imported from OSS

Differential Revision: D21792688

fbshipit-source-id: de095eb148d21e9114a478e9e6047c707d34fd07
2020-06-03 22:27:15 -07:00
0829cadca3 Implement rad2deg, deg2rad (#38852)
Summary:
Resolves https://github.com/pytorch/pytorch/issues/38372.

cc mruberry
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38852

Differential Revision: D21868935

Pulled By: mruberry

fbshipit-source-id: ae6ded11b743c9d1cdc032984b4abe0a115290d6
2020-06-03 22:21:54 -07:00
4d597cb794 [ONNX] Update pytoch/onnx doc (#39480)
Summary:
Updated dos for operator_export_types and recently added op symbolics.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39480

Reviewed By: hl475

Differential Revision: D21877364

Pulled By: houseroad

fbshipit-source-id: 9831fe5776629da897db6d7943f830528cb916d2
2020-06-03 22:15:30 -07:00
cc991bbf19 fix internal targets for layernorm (#39501)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39501

fix internal targets, and disable the test until it is fixed

Test Plan:
built and ran the test, but venkat has to get access to nnpi before
fine tuning the last few pieces. Currently getting around 1e-5 relative error

Reviewed By: yinghai

Differential Revision: D21875657

fbshipit-source-id: 3ae762093084fa65b9aeedaef1b2ca1b1e13b587
2020-06-03 22:09:16 -07:00
2f7f47eba1 [ONNX]Enable tests in test_operators.py (#39431)
Summary:
Enable Dropout and SoftmaxCrossEntropy tests in test_operators.py
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39431

Reviewed By: hl475

Differential Revision: D21877501

Pulled By: houseroad

fbshipit-source-id: 1e9b1e5cf80dc1843bdbde2662f3339e357c6654
2020-06-03 21:49:19 -07:00
0102bbf01e move to.prim_dtype to lite interpreter (#39456)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39456

Move aten::to.prim_dtype from full jit to lite interpreter

Test Plan: verify TTS model can be used

Reviewed By: iseeyuan

Differential Revision: D21856104

fbshipit-source-id: 774981a5c04798e3a87cf7d6e6682f35e604944e
2020-06-03 19:24:24 -07:00
4d880c0693 Device and torch._C function cleanup (#38173)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38173

- Introduce torch.types.Device representing all "device-like" types
- Stubbed torch.device.__reduce__
- Stubbed all torch._C functions comprehensively
- Deleted _safe_call which is unused throughout the codebase

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Differential Revision: D21497399

Pulled By: ezyang

fbshipit-source-id: 1f534442b0ec9a70d556545d072f2c06a08b9d15
2020-06-03 19:17:22 -07:00
4f7c7e2e76 [caffe2] compute r_correction only for radam to avoid sqrt(negative) (#39393)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39393

Computing r_correction should be done only for radam . Otherwise can generate floating-point exceptions.

Test Plan:
buck test caffe2/caffe2/python/operator_test:adam_test -- test_sparse_adam
with --caffe2_operator_throw_if_fp_exceptions=1 gflags option

Differential Revision: D21834296

fbshipit-source-id: a9e6a93451423e76a99f6591d21cb65d4374b008
2020-06-03 19:09:28 -07:00
adc13432fe Enabling lite interpreter in torch python API (#39181)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39181

Create a python binding classes torch._C. LiteScriptModule for mobile::module, a python class called LiteScriptModule is created which wrap torch._C. LiteScriptModule.
Python class LiteScriptModule contains preliminary functions including forward, run_method and __call__.

Create a python api "load_for_lite_interpreter" under torch.jit.mobile where takes pre-saved mobile module in a file-like object as input and returns python class LiteScriptModule.

Add a python binding method "_save_to_buffer_for_mobile" under ScriptModule, and python method "_save_to_buffer_for_lite_interpreter" under RecursiveScriptModule which saves mobile module into buffer instead of file.
ghstack-source-id: 105215736

Test Plan: buck test caffe2/test:mobile

Differential Revision: D21757474

fbshipit-source-id: 758b87497d65c4686459a567d41887c7a577aa4c
2020-06-03 18:33:23 -07:00
3370c045ae Remove copy_imag and copy_real methods (#39065)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/39065

Test Plan: Imported from OSS

Differential Revision: D21803939

Pulled By: anjali411

fbshipit-source-id: c7313c527eb6b54d49ef46aa0a839a3418fa8d7e
2020-06-03 18:22:50 -07:00
5b23f56d5a Selective build on Training, query based. (#39452)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39452

Selective build works on training.
* VariableType_?.cpp are now selectively generated based on the operator list.
* Add a flag in pt_operator_library, "train". If it's True, an extra flag of "pt_train_operator_library" will be added to the labels. A query for "pt_train_operator_library" will be done to aggregate the training operators. With this flag we limit the generated VariableType to used training operators only, to conserve the code size. The models for inference only have train = False by default.
* For testing purpose, caffe2/fb/pytorch_trainer is created. It's based on full jit but the operators are selectively built.
* smartkeyboard_debug_model is used for test. Since the static code analysis is not applied for VariableType yet, the operators are manually added based on debugging error messages.
* At build stage, make selective build optional for training code-gen library.
The reason is that to make fb4a built, the generated VariableType.cpp needs to depend on torch_mobile_train. Torch_mobile_train is not needed for apps with inference only. In those cases training can be turned off to remove the dependency on torch_mobile_train to save size. It can also be used as a switch to check size regression introduced by training.
ghstack-source-id: 105190037

(Note: this ignores all push blocking failures!)

Test Plan:
Training:
```
buck run -c pt.build_from_deps_query=1 -c pt.selective_build=0 -c pt.static_dispatch=0 xplat/caffe2/fb/pytorch_trainer:trainer ~/models/papaya/keyboard/smartkeyboard_debug_model.pt
```

Inference, with and without the new query-based feature:
```
buck run -c pt.build_from_deps_query=1 -c pt.selective_build=0 -c pt.static_dispatch=0 xplat/caffe2/fb/lite_predictor:lite_predictor_bi -- --model=/home/myuan/models/pytext/BI/bi_pytext_0512.bc --input_dims "1,4" --input_type int64 --pytext_len=4
```
```
buck run xplat/caffe2/fb/lite_predictor:lite_predictor_bi -- --model=/home/myuan/models/pytext/BI/bi_pytext_0512.bc --input_dims "1,4" --input_type int64 --pytext_len=4
```

Reviewed By: ljk53

Differential Revision: D21459302

fbshipit-source-id: df71a46d74f8c7448cbf51990804104f1384594f
2020-06-03 18:01:48 -07:00
d137710a64 LayerNorm Fake FP16 Op debug (#39476)
Summary:
LayerNorm Fake FP16 Op debug.
still seeing output mismatches.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39476

Differential Revision: D21871748

Pulled By: hyuen

fbshipit-source-id: ab308e3acff9ce21de41b0f006cbee767983f8e4
2020-06-03 17:35:25 -07:00
c0d3d2f60f Retry/skip test on URLError rather than on HTTPError (#39477)
Summary:
`HTTPError` are raised when server is overloaded, while `URLError` is
raised when network is not available
And since `HTTPError` is an extension of `URLError`, `URLError` should catch both exceptions
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39477

Differential Revision: D21873560

Pulled By: malfet

fbshipit-source-id: 11806671b768705465f562087521ad4887fd20f7
2020-06-03 17:29:40 -07:00
cb530fcd3c Enable some test cases in test_memory_format_operators (#38648)
Summary:
Re-enable some test cases in `test_memory_format_operators` since their corresponding issue has been fixed.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38648

Differential Revision: D21689085

Pulled By: VitalyFedyunin

fbshipit-source-id: 0aa09e0bf31ba98c8ad0191ac3afd31dda0f1d42
2020-06-03 16:02:49 -07:00
9ed5efda47 Adds TestCase.compare_with_numpy (#39179)
Summary:
Cut from https://github.com/pytorch/pytorch/pull/38994.

This is a helper function for comparing torch and NumPy behavior. It updates the existing and increasingly popular _np_compare function and moves it to be a method on TestCase.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39179

Differential Revision: D21855082

Pulled By: mruberry

fbshipit-source-id: edca3b78ae392d32243b02bf61960898b6ba590f
2020-06-03 15:27:32 -07:00
d31e84497c [TensorExpr] some cleanups / fixes for LoopOptions (#39408)
Summary:
Mainly, fix a bug in the HashProvider where it would not include LoopOptions in the hash, meaning two loops would be seen as identical even if they were bound to different thread/block axes. Also added symbolic names for the different axis options.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39408

Differential Revision: D21864494

Pulled By: nickgg

fbshipit-source-id: 9c28729984e7a3375e026c78294c9f75b9015123
2020-06-03 15:11:59 -07:00
e4657fe194 Revert D21579607: [pytorch][PR] Do not call optimizations within freezing API
Test Plan: revert-hammer

Differential Revision:
D21579607

Original commit changeset: a6231754fea8

fbshipit-source-id: 277011605eedee1c3b44fbaf877233b239adf56b
2020-06-03 14:50:45 -07:00
2ed4ed8733 [TensorExpr] Fix two bugs in Rfactor (#39268)
Summary:
The two bugs were:
* Non-reduction axes were not added when inserting the new ReduceOp, meaning if a reduction with non-reduce axes was rfactored we'd produce bad outputs. There were no tests of Rfactor with non-reduce axis so I modified a test to do this.
* The new statements were always prepended to the block, meaning writes to a buffer could be reordered after the usage of that buffer. This mostly happened in the case where we rfactor a previously rfactored reduction. There was a test of this, but since it only tested rfactoring the outer reduction axis there was never any other statements at the insertion point (the tests of the insertion point argument also do this). I added a new test which covers various rfactor-axis cases.

Also cleaned up tests, removed some helper code we don't need etc.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39268

Differential Revision: D21864489

Pulled By: nickgg

fbshipit-source-id: d314d20997a8472ec96b72f7a9068d6da6d2399c
2020-06-03 14:38:34 -07:00
dbec0febd2 Update key_padding_mask arg docs in MHA module (#39321)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/39321

Reviewed By: zhangguanheng66

Differential Revision: D21825488

Pulled By: Nayef211

fbshipit-source-id: 41ee09e683c4ae838cfd488a342088d591e806e4
2020-06-03 13:49:01 -07:00
5cfd1a190e Do not call optimizations within freezing API (#38499)
Summary:
This patch removes call to run optimizations within freezing API.
Only dead code elimination is invoked to clean up the frozen module.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38499

Reviewed By: eellison

Differential Revision: D21579607

Pulled By: bzinodev

fbshipit-source-id: a6231754fea89296a3dcf07b5e37a1c43cb8d5dd
2020-06-03 13:25:24 -07:00
46447045ea Replace torch.allClose with self.assertEqual (#39424)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/39424

Reviewed By: Krovatkin

Differential Revision: D21854870

Pulled By: ailzhang

fbshipit-source-id: eb68f1775596e4c963169033444d6d6f4f818d4f
2020-06-03 12:40:50 -07:00
5d2cfb3d4c [torch] remove integer conversion resulted in a change of sign warning (#38968)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38968

As title

Reviewed By: glaringlee

Differential Revision: D21711684

fbshipit-source-id: c340360b29849fe9ab0e7be376918c92ba3629be
2020-06-03 12:38:18 -07:00
ec5d579929 .github: Add initial target specifier config (#39378)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39378

Will initially only contain a label to trigger builds for binary tests

Signed-off-by: Eli Uriegas <eliuriegas@fb.com>

Test Plan: Imported from OSS

Differential Revision: D21864091

Pulled By: seemethere

fbshipit-source-id: f69467ccc797b6b320dc8b7f2d50a8601c172a1f
2020-06-03 11:23:07 -07:00
21ba3b4f40 Fix torch.backends.cudnn mypy error (#38947)
Summary:
Fix https://github.com/pytorch/pytorch/issues/38410

![image](https://user-images.githubusercontent.com/6421097/82724121-74b26880-9c99-11ea-9b63-e92de2dccdf2.png)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38947

Differential Revision: D21765290

Pulled By: ezyang

fbshipit-source-id: 5d2b25f039a653c609d60cdaac4a7ac5812ae291
2020-06-03 10:55:43 -07:00
6a60a8c1da add_observer: respect device affinity for ReLU (#39337)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39337

In #39031 we made fake quantize respect device affinity of the
original module. However, that PR only handled modules with parameters
or buffers, and did not work properly for `ReLU`.

Fixing the logic to also work for `ReLU` by passing the parent's
device when adding observers.

Test Plan:
```
python test/test_quantization.py TestDistributed.test_device_affinity
```

Imported from OSS

Differential Revision: D21821243

fbshipit-source-id: cc6abda3694b80ce8ba0440dc6c1b5b58f3c0066
2020-06-03 09:31:36 -07:00
884e16b41a as_strided : add size and stride length check (#39301)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/39281
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39301

Differential Revision: D21849082

Pulled By: gchanan

fbshipit-source-id: 5d30ef10767c4d35c6cb59c5e6a9acbfe0270a40
2020-06-03 09:17:54 -07:00
5beb3b0c53 [TensorPipe] Re-enable dist optimizer tests (#39441)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39441

This is the last test suite to be enabled for TensorPipe.
ghstack-source-id: 105166757

Test Plan: Ran the tests, hundreds of times each, in different build modes.

Differential Revision: D21858975

fbshipit-source-id: ee0a7e64b77b4b1974f031207031cc14afb3a8c2
2020-06-03 09:00:52 -07:00
b1dab266f7 [TensorPipe] Re-enable dist autograd tests (#39440)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39440

After the RPC tests, re-enable the second test suite: dist autograd.
ghstack-source-id: 105165393

Test Plan: Ran the tests, several times each, in different build configs.

Differential Revision: D21858974

fbshipit-source-id: 409377d564c36fecae51b9e4c776d94187b434a2
2020-06-03 08:59:19 -07:00
aea09f5155 Leak safety in RReLU (#39347)
Summary:
Fixes gh-38966

If `THCTensor_(resizeAs)` fails to allocate, then these `free`s will never be reached. So, instead I use a wrapped tensor to do cleanup automatically.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39347

Differential Revision: D21838933

Pulled By: ezyang

fbshipit-source-id: 8c74ecdd720d6712a33ddef6126ea545761a269b
2020-06-03 08:27:58 -07:00
c767d65caf Added FPGA DispatchKey, DeviceType, Backend (#38938)
Summary:
ezyang,

I have added the changes to DispatchKey, DeviceType, Backend to support the out-of-tree FPGA.

cc. tataetae
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38938

Differential Revision: D21748955

Pulled By: ezyang

fbshipit-source-id: fe76d9730818205961430d2a0e00727b5c547b32
2020-06-03 07:28:14 -07:00
3f099879f7 [TensorPipe] Re-enable RPC tests (#39406)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39406

For now, just the RPC test (no dist autograd or dist optimizer).

I removed the skipping decorator from all the tests except those that explicitly use the ProcessGroup options.

Includes #39027.
ghstack-source-id: 105159974

Test Plan: Ran the tests several hundred times, in various build modes. Saw some flakes, but at a rate of about 0.1%

Differential Revision: D21716069

fbshipit-source-id: 9d2a99e112049a63745772c18e7a58266ed8e74e
2020-06-03 07:14:30 -07:00
7417b4c66f Fix index overflow in ConvTranspose3d [attempt 2] (#39198)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/32866, resubmit of https://github.com/pytorch/pytorch/issues/38970

The memory error in the issue is caused by int overflowing in col2vol. This version using mixed 32-bit and 64-bit indexing calculation lifts the maximum indexing possible without compromising the performance of ConvTranspose3d. vs 20-30% regression with pure 64-bit indexing.

This requires that input.numel() <= UINT_MAX, and channels * kernel.numel() <= UINT_MAX otherwise it raises an error. Previously, the code would crash or give incorrect results unless input.numel() * kernel.numel() <= INT_MAX.

Note that the test is a minimised reproducer for the issue.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39198

Differential Revision: D21817836

Pulled By: ezyang

fbshipit-source-id: b9adfe9f9dd00f04435be132966b33ac6b9efbef
2020-06-03 07:06:54 -07:00
a05ef17e46 Add rpc.functions.async_execution decorator for rpc_sync/rpc_async (#39216)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39216

The `rpc.functions.async_execution` decorator specifies that the
wrapped function is guaranteed to return a `torch.futures.Future`.
The decorator adds a `_wrapped_async_rpc_function` attribute to
the wrapper function. The caller retrieves this information and
then sets `isAsyncFunction` argument accordingly which is later
added to PythonCall RPC message as a field. On the callee side,
if the PythonCall carries an asynchronous function, it will cast
the function's return value to a jit::PythonFutureWrapper object,
and then install response creation and communication as a callback
on the that jit::PythonFutureWrapper.

For applications, this feature is useful when a function needs to
wait for IO or additional singaling. In those cases, marking the
user function as `rpc.functions.async_execution` will prevent it
from blocking one thread on callee for too long.

Test Plan: Imported from OSS

Reviewed By: rohan-varma

Differential Revision: D21779962

fbshipit-source-id: 6b6aa698bf6f91dad6ed2a7ee433df429b59e941
2020-06-02 23:21:25 -07:00
15ad9dd30f [ONNX] Bump up ONNX submodule to a82c6a7010e2e332d8f74ad5b0c726fd47c85376 (#39372)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39372

we only bump the submodule in oss to unblock some works

Test Plan: ci

Reviewed By: hl475

Differential Revision: D21830800

fbshipit-source-id: fb4a716992efcd71926f7bba24a7c24422c17e38
2020-06-02 21:08:14 -07:00
abe2be2063 [resubmit] Use TensorMethods.cpp (#39385)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39385

see https://github.com/pytorch/pytorch/pull/37639

Test Plan:
https://github.com/pytorch/pytorch/pull/37639

Imported from OSS

Differential Revision: D21833287

fbshipit-source-id: 9928d3f4122903d0de67ad312e349352d5f5c19c
2020-06-02 20:27:51 -07:00
a952f9bb06 Fix for num_threads==1 in OpenMP "parallel for" (#36479)
Summary:
fixes gh-32284

Move the non-parallel stanza out of the parallel context, and use `num_threads` to limit nesting `parallel for`s. The nesting caused a memory leak in the test script in the issue.

This should probably have a test somewhere: are there tests for ParallelOpenMP?
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36479

Differential Revision: D21652452

Pulled By: ilia-cher

fbshipit-source-id: 2cda7777c0eafbe268550a82fed306e52fb6eb25
2020-06-02 18:56:13 -07:00
36607c85ee [TensorExpr] eliminate zero length Allocations in IRSimplifier (#38794)
Summary:
If the size of a temporary buffer is reduced to zero via binding of a dynamic variable we still run the alloc, even though it is a no op. It's easy to strip these out during simplification, so the expr:
```
{
  Allocate(x, int, {0});
  // Stuff...
  Free(x);
}
```
becomes
```
{
  // Stuff...
}
```

I am assuming here that if the allocation size is zero then any usage of the buffer is also eliminated since theres no safe way to refer to a zero size buffer.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38794

Differential Revision: D21723656

Pulled By: nickgg

fbshipit-source-id: 3eaa8bd8974a13b0a351be04abe2348498b31b02
2020-06-02 18:24:42 -07:00
f166b934ee [JIT] Kill _cast_* operators (#39348)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/39348

Test Plan: Imported from OSS

Reviewed By: suo

Differential Revision: D21824594

Pulled By: jamesr66a

fbshipit-source-id: 2563a886e3e5dd22d23a2f39f32fb077c3fb1dba
2020-06-02 16:35:37 -07:00
8638df45ae call DoRunWitType on Layernorm (#39409)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39409

enable running Layernorm fake op

Test Plan: ran the test, results are incorrect

Reviewed By: amylittleyang

Differential Revision: D21845269

fbshipit-source-id: 114e26e4fea80c0a8ab27501503c3ec0dc2fafb5
2020-06-02 16:27:17 -07:00
ebd4125e7e [JIT] Make torch.unique_consecutive compatible (#39339)
Summary:
A `unique_consecutive` version of https://github.com/pytorch/pytorch/pull/38156
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39339

Differential Revision: D21823997

Pulled By: eellison

fbshipit-source-id: d14596a36ba36497e296da5a344e0376cef56f1b
2020-06-02 14:54:29 -07:00
c6720f0d6b nit on functional autograd (#35493)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/35493

Test Plan: Imported from OSS

Differential Revision: D21843416

Pulled By: albanD

fbshipit-source-id: af4d017ff4559237dd31e2ccaa1e3a967f7497ba
2020-06-02 14:49:16 -07:00
89c0efb30b Also set CMAKE_C_STANDARD for MSVC (#39304)
Summary:
According to
<https://gitlab.kitware.com/cmake/cmake/-/blob/master/Modules/Compiler/MSVC-C.cmake>,
the option simply has no effect for MSVC as of today. It is better to not impose
such an if condition as it is a bit misleading (the current code makes it look like we have compatibility issues with MSVC C11 support), and also it's better to
leave the judgment of MSVC C support to CMake devs.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39304

Differential Revision: D21846032

Pulled By: malfet

fbshipit-source-id: 962e5721da3d7b9be4117b42bdc35df426b7da7b
2020-06-02 13:59:07 -07:00
71af538e31 Updated assert to remove check on 3rd dim for MHA (#39402)
Summary:
## Description
* Updated assert statement to remove check on 3rd dimension (features) for keys and values in MultiheadAttention / Transform
* The feature dimension for keys and values can now be of different sizes
* Refer to https://github.com/pytorch/pytorch/issues/27623
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39402

Reviewed By: zhangguanheng66

Differential Revision: D21841678

Pulled By: Nayef211

fbshipit-source-id: f0c9e5e0f33259ae2abb6bf9e7fb14e3aa9008eb
2020-06-02 13:35:39 -07:00
a864dbb360 Make _C extension a thin C wrapper (#39375)
Summary:
It just depends on a single `torch_python` library.
C library does not depend on standard C++ library and as result it closes https://github.com/pytorch/pytorch/issues/36941
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39375

Reviewed By: orionr

Differential Revision: D21840645

Pulled By: malfet

fbshipit-source-id: 777c189feee9d6fc686816d92cb9f109b8aac7ca
2020-06-02 13:11:59 -07:00
09bea13981 support flip and rot90 for complex dtype (#37826)
Summary:
Closes https://github.com/pytorch/pytorch/issues/37698
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37826

Differential Revision: D21657697

Pulled By: mruberry

fbshipit-source-id: 16a3899d5de280da692a52bd0ce85d5ebe14cc31
2020-06-02 13:03:14 -07:00
58cb369dfa Replace calls to contiguous with contiguous(suggested memory format) (#38433)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38433

Wherever applicable it would be better to call contiguous with appropriate
memory format.
Plus output should be allocated with the same memory format as input when
applicable. Otherwise convert to that format upon returning.
This helps with some perf where otherwise calls to contiguous may involve
allocation and memcpy.

Test Plan: quantization tests

Reviewed By: vkuzo

Differential Revision: D21559301

fbshipit-source-id: 2ed5de05fb627eef1bf5d76fba0387ba67370007
2020-06-02 12:53:52 -07:00
0d96f26404 Kill THC_logical{Value, Tensor} (#39069)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/39069

Differential Revision: D21818505

Pulled By: gchanan

fbshipit-source-id: a86462a8a67720f2aaf079a67eb6d6c30bd8ea17
2020-06-02 12:41:08 -07:00
a6f0051db2 Fix test_get_and_set_timeout for TensorPipe Agent (#39353)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39353

This test failed with TSAN since the shortened timeout prevented all
messages from being processed within the timeout during Phase 1 of
wait_all_workers during RPC shutdown. Phase 2 already had a longer timeout, so
we extend this to Phase 1 as well.
ghstack-source-id: 105045926

Test Plan: Ran the test_get_and_set_timeout with TSAN

Differential Revision: D21826783

fbshipit-source-id: 7edfdeb50169b31e997dd36a3fd8eea0e9ae7189
2020-06-02 12:01:11 -07:00
fca928cabf [caffe2] fix test error in video_input_op_test (#39382)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/39382

Test Plan: buck test caffe2/caffe2/python/operator_test:video_input_op_test

Reviewed By: dutran

Differential Revision: D21832355

fbshipit-source-id: 47b1b0610b9600437fe1ed317d5af47d624767fb
2020-06-02 11:48:01 -07:00
04ac41fe70 [caffe2] format video_input_op_test.py (#39381)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39381

To prepare D21832355

Test Plan: Just formatting

Reviewed By: dutran

Differential Revision: D21832354

fbshipit-source-id: bbf6a1377752adaa115ee2e2a5ba546964e3fd08
2020-06-02 11:46:01 -07:00
c3ddb3f7a4 Add rocm image to circleci docker builder (#39262)
Summary:
CC ezyang sunway513
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39262

Differential Revision: D21842412

Pulled By: ezyang

fbshipit-source-id: 00113e16c108b8f3eb92b4d0b93741161259f3ed
2020-06-02 11:40:48 -07:00
8bc5a4939f Add prim::data to lite interpreter (#39335)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/39335

Test Plan: Imported from OSS

Reviewed By: huntrui

Differential Revision: D21820362

Pulled By: iseeyuan

fbshipit-source-id: 0dd4d6cf6fe56cdab9c61709c8f52809edfd12f5
2020-06-02 11:35:41 -07:00
cca29f2969 [Onnxifi] Support quantized output in Onnxifi (#39230)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39230

Pull Request resolved: https://github.com/pytorch/glow/pull/4555

With this we now support cutting in the middle of the quantized domain for Onnxifi. This will allow us to observe intermediate quantized value during Onnxifi. Input still has to be non-quantized tensor though. This will be a follow-up.

Test Plan:
```
buck test  glow/fb/test/numerics:test_fc_nnpi_int8nnpi -- test_quantize
```

Reviewed By: hyuen

Differential Revision: D21783368

fbshipit-source-id: 51001246e9e0357d7ba90bf12279b644f5f30221
2020-06-02 11:29:17 -07:00
35719cdc85 Fix some bugs of argmin/argmax and min/max (#39212)
Summary:
Partial fix of: https://github.com/pytorch/pytorch/issues/39060

There are actually two bugs:
1. `TensorIterator::get_dim_to_split` is asserting on what it shouldn't be.
2. `min_kernel_impl` and `max_kernel_impl` are setting `out_scalar_t` wrongly. `out_scalar_t` is used to compute indices for accumulation buffer, which is only used when the tensor is large enough.

Both are tested in `test_argminmax_large_axis_cuda`, but unfortunately, this test does not run on CI.

This PR makes `test_argminmax_large_axis_cuda` green, but this test is still not run on CI. I suggest keeping https://github.com/pytorch/pytorch/issues/39060 open until we figure out a way to run it.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39212

Differential Revision: D21834723

Pulled By: ngimel

fbshipit-source-id: e8272ac8552c3954ac486ba6e4129fedb545031e
2020-06-02 11:24:02 -07:00
11f1014c05 Adding lost extra_repr() and __setstate __() to activation.py (#39084)
Summary:
Add
```python
    def __setstate__(self, state):
        self.__dict__.update(state)
        if not hasattr(self, 'dim'):
            self.dim = None

    def extra_repr(self):
        return 'dim={dim}'.format(dim=self.dim)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39084

Differential Revision: D21825245

Pulled By: albanD

fbshipit-source-id: c790c288a9b23e7f320a912e21397da16bb1fb5a
2020-06-02 10:53:12 -07:00
b5cd3a80bb Return None instead False, and return bool to None in type stub (#39324)
Summary:
# What's this

Just a small bug fix related to typing stubs.
I haven't open an issue. I will do so if I must open it, but this PR is very small (only 6 lines diff).

## What I encountered

pytorch 1.5.0 with mypy 0.770 behaves odd. The code is following:
```python
import torch

def f() -> int:  # Mypy says: `error: Missing return statement`
    with torch.no_grad():
        return 1
```

No mypy error is expected, but actually mypy 0.770 warns about `Missing return statement`.

## This is because

`mypy >= 0.730` with `--warn-unreachable` says it's unreachable because `torch.no_grad()` may "swallows" the error in the return statement.
http://mypy-lang.blogspot.com/2019/09/mypy-730-released.html

Here is a small "swallowing" example:

```python
from typing import Generator
from contextlib import contextmanager

contextmanager
def swallow_zerodiv() -> Generator[None, None, None]:
    try:
        yield None
    except ZeroDivisionError:
        pass
    finally:
        pass

def div(a: int, b: int) -> float:  # This function seems `(int, int) -> float` but actually `(int, int) -> Optional[float]` because ` return a / b` may be swallowed
    with swallow_zerodiv():
        return a / b

if __name__ == '__main__':
    result = div(1, 0)
    print(result, type(result))  # None <class 'NoneType'>
```

To supress this behavior, one can tell mypy not to swallow any exceptions, with returning `Literal[False]` or `None` in `__exit__` method of the context manager.

# What I did

Return `None` instead of `bool` to tell mypy that "I never swallow your exception".
I chose `None` because I cannot interpret `Literal[False]` without typing_extensions in `python <=3.7`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39324

Differential Revision: D21833651

Pulled By: albanD

fbshipit-source-id: d5cad2e5e19068bd68dc773e997bf13f7e60f4de
2020-06-02 10:46:44 -07:00
bb0377bb24 Expose torch.futures.Future (#39008)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39008

This commit adds a `torch.futures.Future` type and exposes its ctor,
`wait`, `then`, and `set_result` APIs. This type is currently a
wrapper of `c10::ivalue::Future` and mainly used by RPC for now. Later,
we could revamp c10d APIs to return this `Future` type as well. More
utils will be added into `torch.futures` package in followup PRs.

Test Plan: Imported from OSS

Differential Revision: D21723022

Pulled By: mrshenli

fbshipit-source-id: 92e56160544e9bf00d11db3e8347a1b9707882c9
2020-06-02 10:12:56 -07:00
b3fac8af6b Initial support for building on Ampere GPU, CUDA 11, cuDNN 8 (#39277)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39277

This PR contains initial changes that makes PyTorch build with Ampere GPU, CUDA 11, and cuDNN 8.
TF32 related features will not be included in this PR.

Test Plan: Imported from OSS

Differential Revision: D21832814

Pulled By: malfet

fbshipit-source-id: 37f9c6827e0c26ae3e303580f666584230832d06
2020-06-02 10:03:42 -07:00
85b3fa031c [WIP] Layernorm Fake FP16 Op. (#39103)
Summary:
hyuen
I have added few changes to the LayerNorm Fake FP16 op.
Test case: test_layernorm_nnpi_fp16.py
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39103

Reviewed By: hl475

Differential Revision: D21768096

Pulled By: hyuen

fbshipit-source-id: 9bb7a5f759d783149b599706ff8d285653715f01
2020-06-02 09:54:23 -07:00
30146d7391 More fixes about using Windows API through ctypes (#39376)
Summary:
Representation of `NULL` using `c_void_p` is `None` in ctypes.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39376

Differential Revision: D21833451

Pulled By: malfet

fbshipit-source-id: 70ec0a805a6c473e946ce9a7566440b6e0cd81ba
2020-06-02 09:42:09 -07:00
e358adb42c [TensorPipe] Acquire lock when adding message to timeout map (#39398)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39398

The `timeoutMapMutex_` was only used to guard accesses in the timeout thread, but it should have been used also to guard accesses in the `send` method.

The way I found this bug is rather odd. A test was failing because a timeout of 0.5 seconds was firing when it wasn't supposed to. The test was built with TSAN enabled and the point where we were wasting those 500ms was precisely when accessing the `timeoutMap_` in the `send` method. There is of course no reason it would take so long, so I suspect that either such an access triggered a whole lot of lengthy checks in TSAN or, perhaps, that TSAN was delaying it on purpose because it thought it was smelly and wanted to see whether it could cause a race.
ghstack-source-id: 105088618

Test Plan: The test started passing.

Differential Revision: D21838465

fbshipit-source-id: 02cf2bf1fef2e97da99b9c4e77070fe35d2bcbb0
2020-06-02 09:38:24 -07:00
e142d70383 [TensorPipe] Guess IP addr in separate function (#39397)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39397

I said I'd do it in a previous diff, but then I forgot, so here it is.
ghstack-source-id: 105088619

Test Plan: No functional changes

Differential Revision: D21838464

fbshipit-source-id: 74fbe76c7ce879b28c50fd29feecd9f4d71fc44c
2020-06-02 09:37:00 -07:00
de5b8797e6 Remove unboxed only from AMP registrations for cat. (#39156)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39156

TensorList is now supported for boxing, so we can remove
unboxed only from it.  I didn't check if there were other
operators that were incorrectly classified.

Fixes https://github.com/pytorch/pytorch/issues/38958

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Differential Revision: D21819821

Pulled By: ezyang

fbshipit-source-id: 6dcf91bc196554e1721d2c704f3bf524f069534b
2020-06-02 07:49:02 -07:00
283a3ff16d The exception raised when RandomSampler.replacement is non-boolean should be TypeError (#36547)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/36547

Differential Revision: D21818752

Pulled By: ezyang

fbshipit-source-id: 7502a24a0df134c44ac72959ba992777c873f8e9
2020-06-02 06:54:02 -07:00
413f023784 Clean up cast from c10::complex<T> to thrust::complex<T>, and update the workaround CUDA version to <10.2 (#38941)
Summary:
I'm using CUDA 10.1 on Debian buster but I can still experience
compilation issues:

```
/usr/include/thrust/detail/complex/complex.inl(64): error: no suitable conversion function from "const c10::complex<float>" to "float" exists
          detected during:
            instantiation of "thrust::complex<T>::complex(const R &) [with T=float, R=c10::complex<float>]"
/home/hong/xusrc/pytorch/c10/util/complex_type.h(503): here
            instantiation of "T std::abs(const c10::complex<T> &) [with T=float]"
/home/hong/xusrc/pytorch/aten/src/ATen/native/cuda/AbsKernel.cu(17): here
            instantiation of "c10::complex<T> at::native::abs_wrapper(c10::complex<T>) [with T=float]"
/home/hong/xusrc/pytorch/aten/src/ATen/native/cuda/AbsKernel.cu(29): here

/usr/include/thrust/detail/complex/complex.inl(64): error: no suitable conversion function from "const c10::complex<double>" to "double" exists
          detected during:
            instantiation of "thrust::complex<T>::complex(const R &) [with T=double, R=c10::complex<double>]"
/home/hong/xusrc/pytorch/c10/util/complex_type.h(503): here
            instantiation of "T std::abs(const c10::complex<T> &) [with T=double]"
/home/hong/xusrc/pytorch/aten/src/ATen/native/cuda/AbsKernel.cu(17): here
            instantiation of "c10::complex<T> at::native::abs_wrapper(c10::complex<T>) [with T=double]"
/home/hong/xusrc/pytorch/aten/src/ATen/native/cuda/AbsKernel.cu(29): here

2 errors detected in the compilation of "/tmp/hong/tmpxft_00005893_00000000-6_AbsKernel.cpp1.ii".
CMake Error at torch_cuda_generated_AbsKernel.cu.o.Debug.cmake:281 (message):
  Error generating file
  /home/hong/xusrc/pytorch/build/caffe2/CMakeFiles/torch_cuda.dir/__/aten/src/ATen/native/cuda/./torch_cuda_generated_AbsKernel.cu.o
```

`nvcc --version`:

```
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2019 NVIDIA Corporation
Built on Wed_Apr_24_19:10:27_PDT_2019
Cuda compilation tools, release 10.1, V10.1.168
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38941

Differential Revision: D21818790

Pulled By: ezyang

fbshipit-source-id: a4bfcd8ae701f7c214bea0731c13a5f3587b7a98
2020-06-02 06:47:42 -07:00
68f23d566a [pytorch] Let jit.unused ignore unsupported method signature (#39336)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/39336

Test Plan: next diff

Differential Revision: D21814656

fbshipit-source-id: 0bc6bcf668715473553f200a6ffea981abef09a6
2020-06-02 00:16:54 -07:00
f4365cf5ba [JIT] Add support for saving/loading of lowered modules (#38893)
Summary:
**Summary**
This commit adds support for seralization and deserialization of
`ScriptModules` that have been lowered to a specific backend. Nothing
special was required to accomplish this, other than removing some code
in `unpickler.cpp` that guarded against the deserialization of `Any`
type objects. Now that lists and dicts are tagged with their types
during serialization, this check is no longer necessary.

**Test Plan**
This commit adds a unit test for testing that a lowered module still
produces the same results as Python and regular JIT after saving and
loading.

**Fixes**
This pull request fixes part of https://github.com/pytorch/pytorch/issues/37841.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38893

Differential Revision: D21825813

Pulled By: SplitInfinity

fbshipit-source-id: 77a7b84504e0dddf14c89b3ed5dd6b438c086f66
2020-06-01 23:50:52 -07:00
858ab75046 ONNX Export Support for Celu (#38243)
Summary:
Add ONNX export support for torch.nn.CELU
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38243

Differential Revision: D21562188

Pulled By: houseroad

fbshipit-source-id: a7056b3127c88e4a96a551ae906440ed8a153e42
2020-06-01 23:26:44 -07:00
ed26e8b0a0 Resubmit [Onnxifi] Generic way of passing output shape/type hints (#39377)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39377

Previous diff D21781515 had en compliation error on OSS CI and got reverted.

Test Plan: net runner

Reviewed By: jfix71

Differential Revision: D21832199

fbshipit-source-id: 07c6b6fe3bb18dc4f4ecec82ba9b99028086f55c
2020-06-01 22:51:15 -07:00
f16b04f8b3 [caffe2] Update shape info delimiter (#39275)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39275

Replace delimiter '|' with '#' because '|' could appear in the tensor names.

Test Plan:
```
buck test //caffe2/caffe2/fb/opt:shape_info_utils_test
```

AI/AF canary:
https://our.intern.facebook.com/intern/ads/canary/427007822162345576
https://our.intern.facebook.com/intern/ads/canary/427007917016548180

Reviewed By: yinghai

Differential Revision: D21781037

fbshipit-source-id: f83497b12ddf0e7b71d6aed0e20873d52e97fb7f
2020-06-01 22:22:09 -07:00
f29fa06c52 [quant][graphmode][fix] Run preprocess for child module before parent module (#39368)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/39368

Test Plan: Imported from OSS

Differential Revision: D21829296

fbshipit-source-id: b7f001b54bb9f018336cff2810fc4efa9008ee3d
2020-06-01 21:43:39 -07:00
625f4e39a7 [quant] Fix fusion pattern for add_relu (#39367)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39367

We shouldn't match `%alpha` argument since it could be used by multiple functions

Test Plan: Imported from OSS

Differential Revision: D21829295

fbshipit-source-id: 6daa320a4b56df4e142b8e02e04a3ecb36284d1b
2020-06-01 20:15:13 -07:00
3001facd7a [doc] [distributed] fix typo (#39264)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/39264

Differential Revision: D21791426

Pulled By: mrshenli

fbshipit-source-id: c3aa8fda1893aa3c0f9ad3db7da25f1ee80303e8
2020-06-01 19:19:46 -07:00
a47e2d4488 [Futures] Allow setErrorIfNeeded arg to have type FutureError (#39113)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39113

`setError` is overloaded - it can either take `FutureError` or an error message string as an argument. This PR replicates the same behavior for `setErrorIfNeeded`.
ghstack-source-id: 105038824

Test Plan: Sandcastle/CI

Differential Revision: D21753988

fbshipit-source-id: 0f413afd667f0416400aa95f0b2271b286326ac5
2020-06-01 18:30:34 -07:00
f117089810 Restore thrust path for 1d tensors cumulative ops (#39180)
Summary:
Restores thrust path for computing prefix sums for tensors with a single non-degenerate dimension. Benchmark on P100  before:
```
import time
import torch

l = 4000
t=1000
for _ in range(6):
    for dtype in (torch.half, torch.float, torch.double):
        a = torch.randn(l, device="cuda", dtype=dtype)
        print(f'torch.cumsum(a) a.numel() == {l} for {t} times {dtype}')
        # dry run
        torch.cumsum(a, 0)
        torch.cuda.synchronize()
        # Iterate
        start = time.time()
        for _ in range(t):
            torch.cumsum(a, 0)
        # Final Synchronize Before Teardown
        torch.cuda.synchronize()
        end = time.time()
        elapsed = end - start
        bw = t * l * 2 * a.element_size() * 1e-9/elapsed
        print(f'Time {elapsed} bandwidth {bw}')
    l *= 2
```
```
torch.cumsum(a) a.numel() == 4000 for 1000 times torch.float16
Time 0.29149866104125977 bandwidth 0.05488875984145705
torch.cumsum(a) a.numel() == 4000 for 1000 times torch.float32
Time 0.24511313438415527 bandwidth 0.130551959528402
torch.cumsum(a) a.numel() == 4000 for 1000 times torch.float64
Time 0.25238871574401855 bandwidth 0.25357710550304885
torch.cumsum(a) a.numel() == 8000 for 1000 times torch.float16
Time 0.5812790393829346 bandwidth 0.05505101307965633
torch.cumsum(a) a.numel() == 8000 for 1000 times torch.float32
Time 0.4885847568511963 bandwidth 0.13099057861007293
torch.cumsum(a) a.numel() == 8000 for 1000 times torch.float64
Time 0.5031211376190186 bandwidth 0.2544118909528429
torch.cumsum(a) a.numel() == 16000 for 1000 times torch.float16
Time 1.1607651710510254 bandwidth 0.05513604439220951
torch.cumsum(a) a.numel() == 16000 for 1000 times torch.float32
Time 0.9755356311798096 bandwidth 0.13120996907637011
torch.cumsum(a) a.numel() == 16000 for 1000 times torch.float64
Time 1.0045702457427979 bandwidth 0.25483533987283175
torch.cumsum(a) a.numel() == 32000 for 1000 times torch.float16
Time 2.3198938369750977 bandwidth 0.055174938594129294
torch.cumsum(a) a.numel() == 32000 for 1000 times torch.float32
Time 1.949366569519043 bandwidth 0.13132471029456586
torch.cumsum(a) a.numel() == 32000 for 1000 times torch.float64
Time 2.00749135017395 bandwidth 0.2550446854755488
torch.cumsum(a) a.numel() == 64000 for 1000 times torch.float16
Time 4.63812518119812 bandwidth 0.055194715536735495
torch.cumsum(a) a.numel() == 64000 for 1000 times torch.float32
Time 3.897014856338501 bandwidth 0.13138261435345344
torch.cumsum(a) a.numel() == 64000 for 1000 times torch.float64
Time 4.013219356536865 bandwidth 0.2551567479938705
torch.cumsum(a) a.numel() == 128000 for 1000 times torch.float16
Time 9.274584770202637 bandwidth 0.05520462777427539
torch.cumsum(a) a.numel() == 128000 for 1000 times torch.float32
Time 7.792156934738159 bandwidth 0.1314141910354645
torch.cumsum(a) a.numel() == 128000 for 1000 times torch.float64
Time 8.02474856376648 bandwidth 0.2552104883693396
```
after:
```
torch.cumsum(a) a.numel() == 4000 for 1000 times torch.float16
Time 0.033731937408447266 bandwidth 0.47432792864109924
torch.cumsum(a) a.numel() == 4000 for 1000 times torch.float32
Time 0.031197071075439453 bandwidth 1.025737317539167
torch.cumsum(a) a.numel() == 4000 for 1000 times torch.float64
Time 0.03245425224304199 bandwidth 1.972006611667389
torch.cumsum(a) a.numel() == 8000 for 1000 times torch.float16
Time 0.034340858459472656 bandwidth 0.931834596906329
torch.cumsum(a) a.numel() == 8000 for 1000 times torch.float32
Time 0.031183481216430664 bandwidth 2.0523686741645197
torch.cumsum(a) a.numel() == 8000 for 1000 times torch.float64
Time 0.031975507736206055 bandwidth 4.003063878015136
torch.cumsum(a) a.numel() == 16000 for 1000 times torch.float16
Time 0.032624006271362305 bandwidth 1.9617455767895642
torch.cumsum(a) a.numel() == 16000 for 1000 times torch.float32
Time 0.03129267692565918 bandwidth 4.0904138787514
torch.cumsum(a) a.numel() == 16000 for 1000 times torch.float64
Time 0.03260397911071777 bandwidth 7.851802356107085
torch.cumsum(a) a.numel() == 32000 for 1000 times torch.float16
Time 0.032918691635131836 bandwidth 3.888368390176069
torch.cumsum(a) a.numel() == 32000 for 1000 times torch.float32
Time 0.030851364135742188 bandwidth 8.29785026275116
torch.cumsum(a) a.numel() == 32000 for 1000 times torch.float64
Time 0.037447452545166016 bandwidth 13.6724921243299
torch.cumsum(a) a.numel() == 64000 for 1000 times torch.float16
Time 0.03391098976135254 bandwidth 7.549175114073387
torch.cumsum(a) a.numel() == 64000 for 1000 times torch.float32
Time 0.03214144706726074 bandwidth 15.929587704267457
torch.cumsum(a) a.numel() == 64000 for 1000 times torch.float64
Time 0.034329891204833984 bandwidth 29.828233182859922
torch.cumsum(a) a.numel() == 128000 for 1000 times torch.float16
Time 0.03589606285095215 bandwidth 14.263402705915954
torch.cumsum(a) a.numel() == 128000 for 1000 times torch.float32
Time 0.033178091049194336 bandwidth 30.863740728231736
torch.cumsum(a) a.numel() == 128000 for 1000 times torch.float64
Time 0.03487515449523926 bandwidth 58.72375419238841
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39180

Differential Revision: D21824498

Pulled By: ngimel

fbshipit-source-id: b50fadde598e9ce2871201cd6bb22fa6ac0d482e
2020-06-01 18:07:55 -07:00
e286cb5e81 Revert D21781515: [Onnxifi] Generic way of passing output shape/type hints
Test Plan: revert-hammer

Differential Revision:
D21781515

Original commit changeset: dfae3276e8f1

fbshipit-source-id: 2599056bb8e78791e0bcde01c5251db8d5014857
2020-06-01 17:52:42 -07:00
7b07208d86 [Onnxifi] Generic way of passing output shape/type hints (#39229)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39229

Previously we have a ad-hoc way of passing output shape/type hints which is very limited and doesn't support quantized output. We actually have all the shape_info/qshape_info so we pass them as TensorProto and QTensorProto directly. This will pave the way for us to set output to quantized type in OnnxifiOp.

Test Plan:
```
buck test glow/fb/test:net_runner
```

Reviewed By: hyuen

Differential Revision: D21781515

fbshipit-source-id: dfae3276e8f158eed830f1244bea6420a9135aab
2020-06-01 17:21:47 -07:00
c5def603a7 Use @skipIfNoFBGEMM instead of direct check (#39068)
Summary:
Should be a no-op, just makes the intent a bit cleaner
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39068

Differential Revision: D21829464

Pulled By: malfet

fbshipit-source-id: dc174a3d7da3701bd9d31c366dfa9d24044ef27a
2020-06-01 17:15:36 -07:00
e6d86036e2 Fix return types of Windows API functions in __init__.py (#39334)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/39327.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39334

Differential Revision: D21820898

Pulled By: malfet

fbshipit-source-id: ea771f8c44a152cee395ada70f8f129d4ad5283d
2020-06-01 17:03:57 -07:00
295a23f43f [Futures] Added markCompletedIfNeeded API (#39080)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39080

This PR adds a function similar to setErrorIfNeeded for marking
futures complete. It only completes futures if they haven't been completed
already.
ghstack-source-id: 105038825

Test Plan: Sandcastle/CI

Differential Revision: D21746065

fbshipit-source-id: a7791a070f19e1f56aa5c2822edc4b60d8227c2c
2020-06-01 16:41:12 -07:00
48e66859c1 Check illegal output dtype for torch.{min, max} (#38850)
Summary:
The test is currently only enabled for CPU, and it will be enabled for CUDA after the migration of `min` and `max` from THC to ATen is done.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38850

Differential Revision: D21819388

Pulled By: ngimel

fbshipit-source-id: 406343e96bccbf9139eb1f8f2d49ed530dd83d62
2020-06-01 16:09:39 -07:00
a3c87c4922 Make Optimizer.state_dict() nondeterministic (#37347)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/36831.

Instead of using `id()`, an arbitrary yet consistent order-based index is used instead. This results in a deterministic output between runs.

I am not the biggest fan of using `nonlocal` (it appears to be used sparingly in the codebase) to get `start_index` between calls to `pack_group()`, but the alternatives had larger issues:
- Using the last value added to `param_mappings` would be ideal, but that only works if `dict` iteration order is consistent, and PyTorch currently supports Python <3.7.
- Using the maximum value added to `param_mappings` wouldn't have that issue but would not be constant time.

For testing, I confirmed that `test_optim.py` works before and after these changes. Randomizing the indices in `param_mappings` causes the tests to fail, which is further evidence these changes work. I'm not 100% if these tests are sufficient, but they're a start.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37347

Differential Revision: D21353820

Pulled By: vincentqb

fbshipit-source-id: e549f1f154833a461b1f4df6d07ad509aab34ea1
2020-06-01 15:32:02 -07:00
7f1a96d43c Adding sparse Lp regularization operator to Caffe2 (#38574)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38574

Adding sparse L1 and L2 regularization operator to Caffe2.  This doesn't work using run_on_loss, only run_after_optimize.  Applying it to run_after_optimize rather than run_on_loss was easier to implement, particularly for the L1 norm which is preferable in some cases and is non-differentiable at zero.

Test Plan: Wrote and ran unit tests in operator_test:sparse_lp_regularizer_test.

Differential Revision: D21003029

fbshipit-source-id: 81070a621752560ce03e320d065ce27807a5d278
2020-06-01 15:21:19 -07:00
6d3e4aa0f9 Made sure torchscript compiles in optimized mode (#38888)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/38888

Test Plan: ran the build

Reviewed By: zdevito

Differential Revision: D21045046

fbshipit-source-id: f86d51b083cbc530012d36bbc770f13b28f4c65d
2020-06-01 14:53:55 -07:00
f76e05a2e1 Automated submodule update: FBGEMM (#39322)
Summary:
This is an automated pull request to update the first-party submodule for [pytorch/FBGEMM](https://github.com/pytorch/FBGEMM).

New submodule commit: 7d673046a6
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39322

Test Plan: Ensure that CI jobs succeed on GitHub before landing.

Reviewed By: jianyuh

Differential Revision: D21814389

fbshipit-source-id: cec819a28f08915e2443f405d42efaa41a523bc8
2020-06-01 13:14:26 -07:00
cffa0bee04 Don't generate DeviceGuard for CPU wrapping code. (#38806)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38806

I'm trying to delete the Type wrapper code entirely, but I'm
trying to figure out exactly how many device guards I need to
preserve.  For now, delete the guards that are known to be
useless.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Differential Revision: D21764403

Pulled By: ezyang

fbshipit-source-id: 9c3d18f209339dfe2adbe5866b31b03b55990b74
2020-06-01 13:10:57 -07:00
2b6a48e962 Remove supports_named_tensor from codegen entirely. (#38739)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38739

Instead of codegenning the named tensor support checks into
CPUType/CUDAType, we instead add a new dispatch key that is put
into tensor whenever it has names.  By default, the fallback
implementation says that named tensors are not supported, but
if they are supported, we register a fallthrough which lets
us through to the true backend implementation.

There are a bunch of small pieces which are necessary to make this
happen:

- NameMode now also excludes DispatchKey::Named from the dispatch set
- To avoid bad error messages, we add a teensy special case to
  the dispatcher for named_not_supported_kernel: if we see that
  the boxed kernel we need to invoke from unboxed is this kernel,
  but we don't support boxing, but it's a kernel which is known
  to not need boxing, we just pass in nullptr for the stack.
  The special case here is very nice: it doesn't affect the fast
  path and only gets exercised when things are not supported.
- I need to add support for per operator fallthrough registration.
  This is done similarly to how we support fallthrough fallback,
  by just keeping track if the registered kernel for an operator
  is a fallthrough.

It is possible we could go even further down this path, and move
the named tensor logic itself into this key.  I leave this
up to future work.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Differential Revision: D21662643

Pulled By: ezyang

fbshipit-source-id: 5bc6ae14a1f600189bd8bf865f74dd1700d932f7
2020-06-01 13:09:08 -07:00
45baf0e1a0 [Profiler x RPC] Enable RPC Server Global Profiler (#38847)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38847

See motivation and design in https://github.com/pytorch/pytorch/issues/38845.

Close https://github.com/pytorch/pytorch/issues/38845.

Changes,

- Add pre-request and post-response hooks to RPC "request_callback_impl.cpp". For a thread that executes RPC handler, check if the server-side global profiling is on. If it's on, enable profiling on this thread and after response, merge the thread-local profiling result into the global profiling state.
- Add context-style Python API to parse the profiling Events into ranges represented by FunctionEvent.
- Add data-structures to work as global profiling state that support nesting and container for consolidating results from multiple threads.

Test,

- Add a test that uses nested profiling range and inspect the profiling events.

ghstack-source-id: 104991517

Test Plan:
buck test mode/dev-nosan //caffe2/test/distributed/rpc/:rpc_fork

buck build mode/dev-nosan //caffe2/test/distributed/rpc/:rpc_fork && \
buck-out/gen/caffe2/test/distributed/rpc/rpc_fork\#binary.par -r test_server_process_global_profiler

Differential Revision: D5665992

fbshipit-source-id: 07f3bef5efd33d1214ef3404284c3803f5deca26
2020-06-01 12:35:52 -07:00
bdaa78499e Reland Refactor c10::complex and cleanup c10::Scalar (#39306)
Summary:
This reverts commit 8556664d6896a8e7f48f1c419e06e0568b9ee09e.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39306

Differential Revision: D21818096

Pulled By: albanD

fbshipit-source-id: ed4396fcad8c7036fb7bfa2f3da6ed63c0eb6625
2020-06-01 11:51:57 -07:00
39d037253c Test PyTorch using python-3.8 + GCC-9 on Bionic (Reland) (#39121)
Summary:
Enable new test config in .circleci/config.yml
Skip scanning several 3rd-party packages to work around https://bugs.python.org/issue40350
Remove pre python-3.5 checks from `test.sh` and update `scikit-learn` to python-3.8 compatible version

This is a reland of https://github.com/pytorch/pytorch/pull/39030
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39121

Differential Revision: D21820375

Pulled By: malfet

fbshipit-source-id: d0be79b7d204cf692e055d42b9be42402dc4c1c0
2020-06-01 11:11:12 -07:00
78244f8129 Kill CPPTypeToScalarType. It's now subsumed by CPPTypeAndStdComplexToScalarType. (#39263)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39263

CPPTypeToScalarType is confusing because it doesn't handle the different complex types and it maps everything that it doesn't know about to Undefined, which is error prone.

Test Plan: Imported from OSS

Differential Revision: D21790515

Pulled By: gchanan

fbshipit-source-id: ec897fd50bd8f7548a34573e59eb57bf3c6383c6
2020-06-01 11:00:58 -07:00
ddf6d49445 Avoid defining bogus CPPTypeAndStdComplexToScalarType<void> by using some decltype tricks. (#39261)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/39261

Test Plan: Imported from OSS

Differential Revision: D21790288

Pulled By: gchanan

fbshipit-source-id: c1f04577c02f78dbc911aad4cb1d862acbea4b31
2020-06-01 10:58:51 -07:00
Jie
07518e120b [nvFuser] add torch.jit.fuser context manager (#38993)
Summary:
1. `torch.jit.fuser(str)` context manager facilitates switch between backend fusers:
  str - 'fuser0' enables only legacy fuser;
  str - 'fuser1' enables only NNC;
  str - 'fuser2' enables only nvFuser;
2. cleanup updated python tests.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38993

Reviewed By: nairbv, pbelevich

Differential Revision: D21800620

Pulled By: soumith

fbshipit-source-id: 7fe855f5a5b97368e5e84c98c28d04b2e1276c85
2020-06-01 10:52:40 -07:00
42b2dee6c2 verbose unused in torch.backends.cudnn (#39228)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/39228

Differential Revision: D21818455

Pulled By: ezyang

fbshipit-source-id: abf158f2d745fd135cd0966ee30d559cefa456c0
2020-06-01 09:08:03 -07:00
c193bd41f5 fake_quantize: respect device affinity (#39031)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39031

Makes the eager mode QAT prepare logic respect device affinity.
This fixes the issue where a module is on `cuda:0`, and running
the QAT prepare script would add observers on `cpu`.  Now it
will add them on the original device.

Test Plan:
```
python test/test_quantization.py TestDistributed.test_device_affinity
```

Imported from OSS

Differential Revision: D21729272

fbshipit-source-id: 5537bf3977ddc23412184941978bf0d1cc6fb479
2020-06-01 08:55:14 -07:00
2fe0fc2684 Revert D21374247: Use TensorMethods.cpp
Test Plan: revert-hammer

Differential Revision:
D21374247

Original commit changeset: 076964415079

fbshipit-source-id: 732ec8c561d1f37475c1b5549ba79c718e3a6db8
2020-06-01 08:12:09 -07:00
1943a2c317 Fix missing code in 'Installing C++ distribution of Pytorch' (#39237)
Summary:
Fix https://github.com/pytorch/pytorch/issues/39236

- Before:

![image](https://user-images.githubusercontent.com/6421097/83250998-8e0e5580-a16e-11ea-863e-ed4d9e060bdf.png)

- After:

![image](https://user-images.githubusercontent.com/6421097/83250933-73d47780-a16e-11ea-86d3-c5a77d9fa6d1.png)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39237

Differential Revision: D21818392

Pulled By: ezyang

fbshipit-source-id: d7e51de83ec84276e88cbf168bf9e7f57200ff46
2020-06-01 07:54:43 -07:00
7773a45c0d Division by zero crashes for fmod operator(#32699) (#38919)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/38919

Differential Revision: D21791648

Pulled By: anjali411

fbshipit-source-id: 447ded74fa52377b04c1b2271a0b3eb5b8e4eeed
2020-06-01 07:48:52 -07:00
dc4fd0409f DOC: remove java documentation (#38920)
Summary:
Continuation of issue gh-36064 and PR gh-38042 which removed the unmaintained javaspinx extension. The unknown sphinx directives cause warnings when building documentation.

Edit: link to PR as well as issue
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38920

Differential Revision: D21818297

Pulled By: ezyang

fbshipit-source-id: 2c1d007a7689b26653d7dee081b0b969b8a731a2
2020-06-01 07:32:00 -07:00
a566451017 Migrate AT_DISPATCH_FLOATING_AND_COMPLEX_TYPES_AND2 to c10::complex (#39285)
Summary:
All the uses of `AT_DISPATCH_FLOATING_AND_COMPLEX_TYPES_AND2` are for CUDA.

Dispatch macro comes first, cleanup of remaining `c10::complex --> thrust::complex` will be done later.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39285

Differential Revision: D21803978

Pulled By: anjali411

fbshipit-source-id: ec9837f121e3020dfa2d12c8bc9aede9fb01c375
2020-06-01 07:25:47 -07:00
aa5afbdb92 Add dynamic_cast asserts to CPU Loops. (#39258)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39258

On CUDA, we currently support casting loops dynamically (i.e. when the argument or return types of the lamba don't match the dtypes of the TensorIterator).
On CPU, before this change we would essentially reinterpret_cast, now we internal assert. We could add dynamic_casting support in the future on CPU.

Test Plan: Imported from OSS

Differential Revision: D21790020

Pulled By: gchanan

fbshipit-source-id: b52f4340a0553f0c1bd8fafaa58309bc110adecf
2020-06-01 07:23:51 -07:00
9b05b1bacf Make needs_dynamic_casting multiple-complex-type aware. (#39255)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39255

We don't actually cast between these complex representations, but the prior implementation would indicate that we needed to dynamic_cast,
because we didn't have mappings for std::complex or thrust::complex.

This PR makes it so they all map to the same dtype.

Note that this has no functional change as all the use sites have already been changed to take this into account.

Test Plan: Imported from OSS

Differential Revision: D21789694

Pulled By: gchanan

fbshipit-source-id: 6127aab32c40e62bf1b60fe5ccaeffacc60e3b52
2020-06-01 07:23:46 -07:00
f9edbda7d7 Loops: Separate out dynamic_casting concerns from complex overloads. (#39254)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39254

dynamic_casting is meant to handle CUDA kernels when the operand dtypes don't match the C++ kernel function types.
This is made more complicated by the current state of complex, which uses thrust::complex, std::complex, c10::complex.

Currently, thrust::complex and std::complex map to need dynamic casting even though we don't actually cast them.
But, making them not need dynamic_cast doesn't work either because certain dynamic_casting optimizations don't work with thrust::complex and (maybe) std::complex.

So, we separate out these concerns so we can iterate on dynamic_casting checks, in particular by applying them to CPU.

This PR should have no functional change.

Test Plan: Imported from OSS

Differential Revision: D21788870

Pulled By: gchanan

fbshipit-source-id: 5d69c9851423dee2fbe789674f4306710378f4ff
2020-06-01 07:22:09 -07:00
caaea084e9 [caffe2] minor typo fix in fused_rowwise_nbitfake_conversion_ops.h comment (#39315)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39315

As title

Test Plan: Just comment change

Reviewed By: jianyuh

Differential Revision: D21813196

fbshipit-source-id: 3ff6bcd3cc31a4820bf7c7a948123c9e968f5de2
2020-05-31 23:32:39 -07:00
5153cdbe87 [TensorExpr] fix a bug in ReorderAxis when there are trailing loops (#38841)
Summary:
Fixes a bug in reorder axis where we append the new reordered loops to the enclosing block, even if there were statements after it. e.g. with 3 Computes:
```
for (int m1 ...
  for (int n1 ...
    for (int k1 ...
      Body 1
for (int m2 ...
  for (int n2 ...
    for (int k2 ...
      Body 2
for (int m3 ...
  for (int n3 ...
    for (int k3 ...
      Body 3
```

If we reorder loops m2 and k2, we were also reordering the body statements like this:

```
for (int m1 ...
  for (int n1 ...
    for (int k1 ...
      Body 1
for (int m3 ...
  for (int n3 ...
    for (int k3 ...
      Body 3
for (int k2 ...
  for (int n2 ...
    for (int m2 ...
      Body 2
```

This is because we always append the new loops to their parent. This PR fixes the logic to replace the old loop root with the new loop, which keeps things consistent.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38841

Differential Revision: D21723670

Pulled By: nickgg

fbshipit-source-id: 1dee8bb153182fcaa2cabd948197577e8e80acd7
2020-05-31 22:22:45 -07:00
68e62b9ab6 Use TensorMethods.cpp (#37639)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37639

Changing TensorMethods.h to .cpp
Necessary to avoid incomplete types in dispatcher

Test Plan:
CI

Imported from OSS

checked mobile size, no change, small reduction in size in fbios
fbios: Succeeded
Change in Download Size for arm64 + 3x assets variation: -18.2 KiB
Change in Uncompressed Size for arm64 + 3x assets variation: -8.8 KiB

reran benchmark, no stat. significant difference

buck run mode/opt caffe2/caffe2/fb/high_perf_models/pytorch/benchmark_framework_overheads:benchmark_torchscript_model -- --model_file caffe2/caffe2/fb/high_perf_models/pytorch/benchmark_framework_overheads/addmodule.pt --num_runs 3

╷ @  68592d0d  41 minutes ago  iliacher  D21374247
╭─╯  Use TensorMethods.cpp

Created 3 benchmark runs on aibench for caffe2/caffe2/fb/high_perf_models/pytorch/benchmark_framework_overheads/addmodule.pt.
Links to the results:

* Adhoc run: https://our.intern.facebook.com/intern/aibench/details/1729113760

* Adhoc run: https://our.intern.facebook.com/intern/aibench/details/3867976782

* Adhoc run: https://our.intern.facebook.com/intern/aibench/details/2782186766

hg prev

@  7f501b42  Thursday at 14:26  bvaughan  D21764704
╷  short-circuit pow for complex 1 and 0 exponents

Created 3 benchmark runs on aibench for caffe2/caffe2/fb/high_perf_models/pytorch/benchmark_framework_overheads/addmodule.pt.
Links to the results:

* Adhoc run: https://our.intern.facebook.com/intern/aibench/details/2155256332

* Adhoc run: https://our.intern.facebook.com/intern/aibench/details/1802057074

* Adhoc run: https://our.intern.facebook.com/intern/aibench/details/4119590830

Differential Revision: D21374247

fbshipit-source-id: 076964415079cf84fb57f1f7b43d087afed86e1d
2020-05-31 17:11:12 -07:00
f872cf5ed0 Add %= support in TorchScript (#38983)
Summary:
Fix https://github.com/pytorch/pytorch/issues/38336

Add %= support in TorchScript. It's now possible to do something like:
```py
torch.jit.script
def mm(a,b):
    a %= b
    return a
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38983

Differential Revision: D21803523

Pulled By: SplitInfinity

fbshipit-source-id: 3437860d06d32e26ca9a5497099148c1f1616c5b
2020-05-31 12:51:56 -07:00
8556664d68 Revert D21769463: [pytorch][PR] Refactor c10::complex and cleanup c10::Scalar
Test Plan: revert-hammer

Differential Revision:
D21769463

Original commit changeset: 3cb5bcbb0ff3

fbshipit-source-id: 0392e23d7057f90e7b13c9abf19bcca2d84b26fa
2020-05-30 18:02:51 -07:00
928ce29ee2 Refactor c10::complex and cleanup c10::Scalar (#38593)
Summary:
**Main:**
- `c10::complex` is refactored: it no longer uses inheritance to specialize constructors, but using SFINAE instead. This implementation is cleaner and avoids some compiler bugs.
- `c10::Scalar` is cleaned up: it no longer needs to store complex as `double z[2]`, `c10::complex<double>` will work.

**Other cleanups:**
- `numeric_limits` of `c10::complex` is moved to `complex_utils.h`
- the variable in `c10::complex` storing real and imag is changed from `storage[2]` to `real_` and `imag_`
- remove the `c10::` before `complex` when in `c10` namespace
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38593

Differential Revision: D21769463

Pulled By: anjali411

fbshipit-source-id: 3cb5bcbb0ff304d137221e00fe481a08dba7bc12
2020-05-30 13:33:51 -07:00
fcef43965b [AMD] Fix broken test (#39297)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39297

histogram op doesn't have GPU implementation. It's breaking the CI GPU test. Make the test run cpu only.

Test Plan: CI

Reviewed By: hwangjeff

Differential Revision: D21800824

fbshipit-source-id: 9c835786f22bac7d420ce610397a6ee69084c19a
2020-05-30 13:12:24 -07:00
b7b99ab0c8 [ONNX] Remove Aten ops from ONNX export (#37239)
Summary:
This PR adds a new operator export type to exporter: ONNX_FALLTHROUGH
This new type allows ops that are not supported to pass through.
This PR also removes all aten ops in ONNX operator export type mode.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37239

Reviewed By: hl475

Differential Revision: D21440509

Pulled By: houseroad

fbshipit-source-id: 38b826677cf3431ea44868efebefe1ff51c9aa75
2020-05-29 21:20:14 -07:00
c02cb7aa08 [nnpi fake ops] bug fix int8QuantizeNNPI (#39271)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39271

Caused 10% NE loss. Bug in emulation itself and NNPI is fine.

Test Plan: mobile_cvr has no NE loss after this fix: https://fburl.com/mlhub/z6hd8rhn

Reviewed By: hyuen

Differential Revision: D21793205

fbshipit-source-id: a908e95c26c2353f982d05e0a20f02f3c724715d
2020-05-29 20:49:07 -07:00
b0d6e4b604 work around building onnx in older rocm docker images (#39253)
Summary:
CC ezyang xw285cornell sunway513 malfet
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39253

Differential Revision: D21799868

Pulled By: xw285cornell

fbshipit-source-id: 3ced799c0a3a3f1e052b362e8333dda2f76aeecd
2020-05-29 19:16:08 -07:00
25a6c5f60f [quant] Dynamic Linear module to use reduce_range (#39125)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39125

switch to setting reduce_range to true for version > 3.
Models serialized with older state_dict will have version <=3 so will be run with reduce_range=false

Verified with backward compatibility tests (works with no changes to these tests)

Test Plan:
python test/test_quantization.py

Imported from OSS

Differential Revision: D21769689

fbshipit-source-id: 131f2ae736e31705222e82bdc77480f2f1826fe8
2020-05-29 18:21:57 -07:00
9cacbe29e5 [quant] Add reduce_range argument for qlinear_dynamic (#39041)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39041

reduce_range option restricts the activation tensor to 7 bits instead of 8.
This is necessary to enable per channel quant for RNNs and LSTMs

Test Plan:
python test/test_quantization.py TestDynamicQuantizedLinear

Imported from OSS

Reviewed By: akinh

Differential Revision: D21769691

fbshipit-source-id: ef0e9873367f3c1b34091b0b3af788233ef60c6c
2020-05-29 18:19:36 -07:00
001102c50c Avoid a TensorIterator/Loops reinterpret_cast in a test. (#39246)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39246

This was found by adding some error checking in https://github.com/pytorch/pytorch/pull/38817, but that needs more work to be able to merge, so we just do a one-off fix here.

Test Plan: Imported from OSS

Differential Revision: D21786761

Pulled By: gchanan

fbshipit-source-id: e4ecf6506c8649214d0fddfcca2ada6afa339d3b
2020-05-29 16:21:33 -07:00
7ab96461d0 Remove some unnecessary code in Onnxifi (#39197)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39197

Pull Request resolved: https://github.com/pytorch/glow/pull/4552

Clean up some code to prepare support for quantized output in Onnxifi.

Reviewed By: jfix71

Differential Revision: D21770855

fbshipit-source-id: 8ecfd675846e3a42a80fd133e5eaa8dad0445bd3
2020-05-29 15:54:59 -07:00
a5e023f28a Set RecordFunction id only when needed (#39265)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39265

In this PR we set id of RecordFunction only when callbacks need them and when
there's at least one active callback

Test Plan:
testRecordFunction unit test in test_misc.cpp
buck test mode/dev caffe2/test/cpp/jit:jit

https://our.intern.facebook.com/intern/testinfra/testrun/8725724291116413

Reviewed By: dzhulgakov

Differential Revision: D21790421

fbshipit-source-id: 016623d7f1a2a271921a71c0483061e232b40321
2020-05-29 15:34:44 -07:00
1c67c3d587 test_fc_nnpi_fp16.py test_fc_num0_fix fix (#39248)
Summary:
Update test_fc_num0_fix test case to limit max_examples=5
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39248

Test Plan: test_fc_num0_fix

Reviewed By: amylittleyang

Differential Revision: D21787870

Pulled By: yinghai

fbshipit-source-id: 9db85c44e8d0e5492b5862d2716b3baf55a466df
2020-05-29 15:07:07 -07:00
1d0ec50a02 [quant][graphmode] Rename _quantize_script.py to quantize_script.py (#39122)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/39122

Test Plan: Imported from OSS

Differential Revision: D21757619

fbshipit-source-id: 603c020aaaf6f467e63f15b4f271fe946d9fb949
2020-05-29 12:33:40 -07:00
a50d781c03 Added real and imag views as tensor attributes (#39033)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39033

Added `real` and `imag` views as tensor attributes. Right now, tensor.imag is disabled for real tensors. This is because if we return a new tensor of zeros, the user would be able to update the tensor returned by tensor.imag which should not be allowed as numpy returns a read-only array, and pytorch doesn't support read-only tensors yet.

TODO in follow-up PRs:
1. add a setter for `real` and `imag`
2. add special case in codegen for `real` and `imag` backward functions.
3. remove `copy_real` and `copy_imag` methods.

Test Plan: Imported from OSS

Differential Revision: D21767542

Pulled By: anjali411

fbshipit-source-id: 539febf01f01ff055e3fbc7e9ff01fd3fe729056
2020-05-29 12:31:51 -07:00
c3d3782c80 Fix init-shutdown race condition in autograd engine (#39194)
Summary:
If Engine is created shortly before application exits, then non-reentrant thread might not have a chance to spawn which would result in an infinite wait in `Engine::~Engine()`
Prevent this by actually waiting for threads to spawn before returning from `Engine::start_device_threads()`
Make sure that thread count is incremented before GIL is acquired in PythonThread
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39194

Differential Revision: D21789219

Pulled By: malfet

fbshipit-source-id: d9b5e74d5ddeb2474b575af2e4f33d022efcfe53
2020-05-29 12:20:31 -07:00
88c5fd94e7 [nnpi eval] enable int8 eval with emulation Int8FC (#39112)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39112

Allow int8 packed weights in int8 model to deserialize to original format. Set default deserialization behavior in eval workflows to original format.

Test Plan: Tested with workflow: f192797187

Reviewed By: yinghai

Differential Revision: D21737940

fbshipit-source-id: 7afaf307b16cb4e85e61f019356f83fdab772c57
2020-05-29 11:59:12 -07:00
29c04acdbb Followup for cuda assert cleanups (#39220)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/39220

Differential Revision: D21786485

Pulled By: malfet

fbshipit-source-id: 06d11519709a648f096907b733d97f643633171b
2020-05-29 11:53:46 -07:00
a5d44800f0 Implement CUDA_KERNEL_ASSERT for MSVC (#39218)
Summary:
Tested locally on CPU/GPU + Debug/Release.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39218

Differential Revision: D21786500

Pulled By: malfet

fbshipit-source-id: 7e871003d3509436952932b5ff3599e36bb8f205
2020-05-29 11:44:54 -07:00
c25b3d4305 [ROCm] in test_cuda.py, re-enable skipped tests (#37952)
Summary:
- test_stream_context
- test_cublas_multiple_threads_same_device
- test_cusparse_multiple_threads_same_device

These tests passed three rounds of CI.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37952

Differential Revision: D21532027

Pulled By: vincentqb

fbshipit-source-id: dce7fc4f0943e2be43da71e213e168c455c66751
2020-05-29 11:38:47 -07:00
85d0292c14 [quant][graphmode] Cleanup inplace API (#38827)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/38827

Test Plan: Imported from OSS

Differential Revision: D21673481

fbshipit-source-id: becca38efcf720089407c981419b33f629a33e91
2020-05-29 11:13:25 -07:00
7836eaceee [JIT] JIT should let people know we inferred an argument as a tensor (#38527)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38527

This PR solves issue (#37200).
Error is encountered during IR generation while trying to resolve the call to sum.
Should let user know it inferred the value for argument 'dim' to be of type 'Tensor'
because it was not annotated with an explicit type.

Test Plan:
Add code to reprodue the issue (#37200)
`python test/test_jit.py TestJit.test_inferred_as_tensor`

Differential Revision: D21743876

Pulled By: superwizard2019

fbshipit-source-id: 370ca32afea4d53b44d454f650f7d3006f86bcc6
2020-05-29 10:41:50 -07:00
86f46ac9ca Fix assertNotEqual error reporting (#39217)
Summary:
`msg` argument must be passed to `assertRaises`, because its exception is passed upstream (with custom error message) if `assertEquals` succeedes.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39217

Differential Revision: D21786141

Pulled By: malfet

fbshipit-source-id: f8c3d4f30f474fe269e50252a06eade76d575a68
2020-05-29 10:35:56 -07:00
f44fca882e Update NNPI backend to v0.6.0.5 (#4539)
Summary:
Updating to NNPI Backend version v0.6.0.5
Pull Request resolved: https://github.com/pytorch/glow/pull/4539

Test Plan:
Imported from GitHub, without a `Test Plan:` line.

**Glow Dev Mode Testing ICEREF**
```
buck test //glow: -- NNPI
```
https://www.internalfb.com/intern/testinfra/testconsole/testrun/844425092762063/

**Glow Opt Mode Testing ICEREF**
```
buck test mode/opt //glow: -- NNPI
```
https://www.internalfb.com/intern/testinfra/testconsole/testrun/8444249313953808/

**Glow Opt Mode Testing On Device**
```
buck test mode/opt -c glow.nnpi_use_inf_api=true //glow: -- NNPI -j 1
```
https://www.internalfb.com/intern/testinfra/testconsole/testrun/5629499560791922/

**Net_runner defaults OPT ICEREF**
```
buck test mode/opt //glow/fb/test:net_runner_nnpi
```
https://www.internalfb.com/intern/testinfra/testconsole/testrun/1970324865910264/

**Net_runner defaults OPT CARD**
```
USE_INF_API=1 buck test mode/opt //glow/fb/test:net_runner_nnpi
```
FAIL https://www.internalfb.com/intern/testinfra/testconsole/testrun/6192449501601838/

**Net_runner tiny ctr_mbl_feed_2020q1 OPT CARD**
```
USE_INF_API=1 LD_LIBRARY_PATH=third-party-buck/platform007/build/fb-nnpi/lib ./buck-out/opt/gen/glow/fb/test/net_runner_nnpi --logfiledb  ~/test/161676462_0.predictor --opt_net ~/test/debug_optimized_net_0.pb_txt --use_input ~/test/inputs.pb.recordio --glow-nnpi-memory=13000000 --glow-num-devices=2 --ref_impl=glow --test_impls=glow,c2_fp16 --caffe2_fbgemm_fake_fp16_clamp --glow_global_fp16 --glow_clip_fp16 --glow_global_fused_scale_offset_fp16 --fbgemm_deserialize_to_original_format --inference_threads 16 --load_model_by_blob --glow_global_fp16_placeholders --glow_global_fp16_constants --glow_clip_fp16_skip_inputs --glow_nnpi_lower_all_batch_matmul=false --glow_nnpi_num_parallel_chunks=6 --print_latency
```
success

**Net_runner FP16 OPT CARD**
```
USE_INF_API=1 LD_LIBRARY_PATH=third-party-buck/platform007/build/fb-nnpi/lib ./buck-out/opt/gen/glow/fb/test/net_runner_nnpi  --opt_net buck-out/opt/gen/glow/fb/test/instagram_ctr_model_debug_optimized_net/debug_optimized_net_0.pb_txt --logfiledb buck-out/opt/gen/glow/fb/test/instagram_ctr_model_tiny/105533872_0.predictor --use_input buck-out/opt/gen/glow/fb/test/instagram_ctr_model_inputs_pb/inputs.pb --inference_threads=1 --glow_global_fp16 --glow_global_fused_scale_offset_fp16=1 --glow_clip_fp16
```
FAIL P131738351

**numerics tests on IceRef**
```
buck test //caffe2/caffe2/contrib/fakelowp/test:test_batchmatmul_nnpi_fp16nnpi
```
https://www.internalfb.com/intern/testinfra/testconsole/testrun/1688849890354625/
```
buck test //caffe2/caffe2/contrib/fakelowp/test:test_batchnorm_nnpi_fp16nnpi
```
https://www.internalfb.com/intern/testinfra/testconsole/testrun/1125900069449763/
```
buck test //caffe2/caffe2/contrib/fakelowp/test:test_fc_nnpi_fp16nnpi
```
https://www.internalfb.com/intern/testinfra/testconsole/testrun/5910974535833584/
```
buck test //caffe2/caffe2/contrib/fakelowp/test:test_op_nnpi_fp16nnpi
```
https://www.internalfb.com/intern/testinfra/testconsole/testrun/5910974535833516/
```
buck test //caffe2/caffe2/contrib/fakelowp/test:test_int8_ops_nnpinnpi
```
https://www.internalfb.com/intern/testinfra/testconsole/testrun/2533274820486771/
```
buck test //caffe2/caffe2/contrib/fakelowp/test:test_sls_4bit_nnpi_fp16nnpi
```
https://www.internalfb.com/intern/testinfra/testconsole/testrun/5910974536132077/
```
buck test //caffe2/caffe2/contrib/fakelowp/test:test_sls_8bit_nnpi_fp16nnpi
```
https://www.internalfb.com/intern/testinfra/testconsole/testrun/4222124677262742/
```
buck test //caffe2/caffe2/contrib/fakelowp/test:test_sls_8bit_nnpi_fp32nnpi
```
https://www.internalfb.com/intern/testinfra/testconsole/testrun/1688849890355133/

Previously disabled tests `test_slws_fused_8bit_rowwise_acc32_nnpi` and `test_small_sls_acc32` still FAIL
https://www.internalfb.com/intern/testinfra/testconsole/testrun/7599824383821851/

**numerics tests on Card**
```
buck test -c glow.nnpi_use_inf_api=true //caffe2/caffe2/contrib/fakelowp/test:test_batchmatmul_nnpi_fp16nnpi
```
https://www.internalfb.com/intern/testinfra/testconsole/testrun/4222124677263984/
```
buck test -c glow.nnpi_use_inf_api=true  //caffe2/caffe2/contrib/fakelowp/test:test_batchnorm_nnpi_fp16nnpi
```
https://www.internalfb.com/intern/testinfra/testconsole/testrun/1125900069450391/
```
buck test -c glow.nnpi_use_inf_api=true //caffe2/caffe2/contrib/fakelowp/test:test_fc_nnpi_fp16nnpi
```
https://www.internalfb.com/intern/testinfra/testconsole/testrun/1688849890354230/

```
buck test -c glow.nnpi_use_inf_api=true //caffe2/caffe2/contrib/fakelowp/test:test_op_nnpi_fp16nnpi
```
https://www.internalfb.com/intern/testinfra/testconsole/testrun/5910974535833691/

```
buck test -c glow.nnpi_use_inf_api=true //caffe2/caffe2/contrib/fakelowp/test:test_int8_ops_nnpinnpi
```
https://www.internalfb.com/intern/testinfra/testconsole/testrun/3659174723845865/

```
buck test -c glow.nnpi_use_inf_api=true //caffe2/caffe2/contrib/fakelowp/test:test_sls_4bit_nnpi_fp16nnpi
```
https://www.internalfb.com/intern/testinfra/testconsole/testrun/5910974536135102/

```
buck test -c glow.nnpi_use_inf_api=true //caffe2/caffe2/contrib/fakelowp/test:test_sls_8bit_nnpi_fp16nnpi
```
https://www.internalfb.com/intern/testinfra/testconsole/testrun/2251799842675543/

```
buck test -c glow.nnpi_use_inf_api=true //caffe2/caffe2/contrib/fakelowp/test:test_sls_8bit_nnpi_fp32nnpi
```
https://www.internalfb.com/intern/testinfra/testconsole/testrun/2533274820487658/

Previously disabled tests `test_slws_fused_8bit_rowwise_acc32_nnpi` and `test_small_sls_acc32` still FAIL
https://www.internalfb.com/intern/testinfra/testconsole/testrun/1407375046178756/

Reviewed By: arunm-git

Differential Revision: D21697616

Pulled By: hl475

fbshipit-source-id: 3732324986eb40e644686cdd10e44951678508a7
2020-05-29 10:30:45 -07:00
b44f02f8f5 Fix windows upload jobs (#39249)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/39247.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39249

Differential Revision: D21788050

Pulled By: seemethere

fbshipit-source-id: 05355364ac063000bd3023e9af50bb6af39b639e
2020-05-29 09:57:36 -07:00
10e2126b10 support complex types for cumsum, cumprod (#39063)
Summary:
Adds complex support to `cumsum`, `cumprod` and relevant test update in `test_torch::tensor_op_tests`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39063

Differential Revision: D21771186

Pulled By: anjali411

fbshipit-source-id: 632916d4bdbd1c0941001898ab8146be2b7884fc
2020-05-29 09:36:26 -07:00
4b5e87f94a Revert D21751663: [pytorch][PR] Fix argmin/max bug
Test Plan: revert-hammer

Differential Revision:
D21751663

Original commit changeset: 6d55e4bb7834

fbshipit-source-id: 5473af5650b8a14f1da32d660be43ccf027513e1
2020-05-29 09:08:46 -07:00
d6715e6364 Improve warnings to actually point at user code (#39143)
Summary:
These warning's goal is to show the user where to be careful in their code. So make them point to the user's code.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39143

Differential Revision: D21764201

Pulled By: albanD

fbshipit-source-id: f1369d1b0e71d93af892ad3b7b1b3030e6699c59
2020-05-29 06:45:24 -07:00
d1212e5814 [TensorPipe] Use PrefixStore to avoid conflicting keys (#39185)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39185

The TP agent used the store for two things: mapping ranks to names, and mapping names to addresses. The former was prefixed, the latter wasn't. So, if a worker had a name which was `names/0` this would lead to a conflict. We should prefix both usages, and we can do so easily with the `PrefixStore`.
ghstack-source-id: 104837023

Test Plan: Unit tests

Differential Revision: D21767862

fbshipit-source-id: a256c0b9be349c7ffc11ac2790a2a682e3af84d5
2020-05-29 03:36:18 -07:00
99f6df3c07 [TensorPipe] Bind to hostname's IP address instead of localhost (#39184)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39184

TensorPipe has implemented some helpers to resolve the IP address of the hostname and to retrieve the IP address of a given interface using libuv, which means they are supposed to be portable across Linux, Mac, Windows... We can thus replace the version we had implemented inside the agent itself (which only resolved the hostname) with those helpers.
ghstack-source-id: 104837026

Test Plan: Unit tests

Differential Revision: D21494693

fbshipit-source-id: 4652dde6f7af3a90e15918506a103408f81ced0b
2020-05-29 03:36:13 -07:00
3ac0ec3dab [TensorPipe] Don't use separate heap allocation for metrics (#39183)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39183

I didn't see any reason behind it, and it seems to work even after removing the unique_ptrs. (Well, it compiles...)
ghstack-source-id: 104837027

Test Plan: None...

Differential Revision: D21767863

fbshipit-source-id: daebfae69d5b63f1d10345abd625b7e0ddce7e6d
2020-05-29 03:36:07 -07:00
587d453b0f [TensorPipe] Ignore expected errors (#39182)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39182

When the TensorPipe context is closed and joined, all pending callbacks are invoked with an error of type PipeClosedError. This is normal and expected, and should not be logged.

There is still one log that should be addressed, namely when an incoming pipe from a remote worker dies after we have joined, which I still need to address. That will require some type of "signal" from the remote worker that the shutdown is intentional, for example sending an empty packet?
ghstack-source-id: 104837024

Test Plan: Logs become less spammy.

Differential Revision: D21703036

fbshipit-source-id: 0a2f9985032b9f1aaf7d2b129ce6d577f13062a4
2020-05-29 03:34:45 -07:00
debb7ba6f4 Update TensorPipe submodule (#39189)
Summary:
Pick up a fix to SHM, which was crashing when writing to a full reactor ringbuffer.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/39189

Test Plan: Testing by CI.

Reviewed By: mrshenli

Differential Revision: D21769275

fbshipit-source-id: 1499f028d85de3a2facc79277ac5bdea73fd15cc
2020-05-29 02:02:12 -07:00
fce01a9bab [JIT] Make new zip serialization for torch save/load significantly (~70%) faster (#38379)
Summary:
Before:
```
2020-05-11 18:31:41 INFO     Benchmarking 'basic', best of 10 runs (with 1 warmup runs)
{
  "Big Tensors Save": {
    "mean": 17.8048762,
    "median": 17.458917
  },
  "Big Tensors Load": {
    "mean": 3.2556887,
    "median": 2.9668495000000004
  },
  "Small Tensors Save": {
    "mean": 4.0381357,
    "median": 3.9440125
  },
  "Small Tensors Load": {
    "mean": 5.8792499,
    "median": 5.603067
  },
  "benchmark_run_at": "2020-05-12T01:31:41"
}
```
After
```
Use zipfile serialization: True
2020-05-12 20:15:32 INFO     Benchmarking 'basic', best of 10 runs (with 1 warmup runs)
{
  "Big Tensors Save": {
    "mean": 4.7534657,
    "median": 4.646732
  },
  "Big Tensors Load": {
    "mean": 3.6001919,
    "median": 3.493285
  },
  "Small Tensors Save": {
    "mean": 4.1066924,
    "median": 4.1219255
  },
  "Small Tensors Load": {
    "mean": 6.3902358,
    "median": 6.36977
  },
  "benchmark_run_at": "2020-05-13T03:15:32"
}
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38379

Differential Revision: D21779494

Pulled By: voznesenskym

fbshipit-source-id: 694d65029a5b817424d454bd331e285df828c67a
2020-05-29 01:56:18 -07:00
b08a4aaf3b [PyTorch] Fix operator perf observer index issue.
Summary: Fix operator perf observer index issue.

Test Plan:
make sure that the operator index is populated correctly, ran benchmarking for pytext_mobile_inference, see result:
https://www.internalfb.com/intern/aibench/details/598900068317693

Reviewed By: linbinyu

Differential Revision: D21779222

fbshipit-source-id: 0fc3561d83d10cfabd73e1e6b6ee240ce0bafd80
2020-05-28 21:52:24 -07:00
d0650af2fb Change __CUDACC__ and __HIPCC__ to __CUDA_ARCH__ and __HIP_ARCH__ in NumericUtils.h (#39213)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39213

This PR fixes the problem that [__expf/__logf/__tanf](https://docs.nvidia.com/cuda/cuda-math-api/group__CUDA__MATH__INTRINSIC__SINGLE.html) are "intrinsic functions that are only supported in device code", so nvcc doesn't recognize them if it compiles host code. So `__CUDACC__ ` should be replaced with `__CUDA_ARCH__ `

Test Plan: Imported from OSS

Differential Revision: D21779132

Pulled By: pbelevich

fbshipit-source-id: b326e2135525b6a1f2392f8d1c17b735d8ef431a
2020-05-28 21:33:31 -07:00
2331853236 [caffe2] Fix the correctness check for GivenTensorFill operator
Summary:
DCHECK is never triggered and the user error could lead to crash.

I could make the error message be even nicer by checking shape in contructor, but even this would do.

Reviewed By: m3rlin45

Differential Revision: D21778992

fbshipit-source-id: a8ec2faaf734746f6dc42879705245851dc99bed
2020-05-28 21:15:45 -07:00
2f49757372 Remove sumall from TH, THC, THCUNN (#39042)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/39042

Differential Revision: D21765425

Pulled By: ngimel

fbshipit-source-id: a95aba48b7202a723d9f27cb24fe98fbc26ca2c0
2020-05-28 21:03:21 -07:00
ca6579bd40 Regenerate config.yml (#39215)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/39215

Differential Revision: D21779893

Pulled By: malfet

fbshipit-source-id: 69c9c6167fbeba34626ff677696dd97a0865b373
2020-05-28 20:25:04 -07:00
a04fb2ab22 [Reland] add xenial + cuda 9.2 + gcc 5.4 CI test (#39036)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/39036

Test Plan: Imported from OSS

Differential Revision: D21731026

Pulled By: glaringlee

fbshipit-source-id: ae678f786f95e3687ed6b3f176fe6736a436c721
2020-05-28 19:48:18 -07:00
f7a8851e9e Fix argmin/max bug (#38946)
Summary:
Fix https://github.com/pytorch/pytorch/issues/38922

# Reproduction

-  This is correct
```py
>>> torch.zeros(1, 32767).argmax(dim=0)
tensor([0, 0, 0,  ..., 0, 0, 0])
```

- But this is not
```py
>>> torch.zeros(1, 32768).argmax(dim=0)
tensor([    0,     0,     0,  ..., 31141, 31141, 31141])
```

- Only occurs when the size of the reduced dimension is 1

```py
>>> torch.zeros(2, 327680).argmax(dim=0)
tensor([1, 1, 1,  ..., 1, 1, 1])
>>> torch.zeros(3, 327680).argmax(dim=0)
tensor([2, 2, 2,  ..., 2, 2, 2])
```

- Has something to do with the rest of the dims
```py
>>> torch.zeros(1, 327680).argmax(dim=0)
tensor([     0,      0,      0,  ..., 311296, 311296, 311296])
```
```py
>>> torch.zeros(1, 32768, 10).argmax(dim=0)
tensor([[     0,      0,      0,  ...,      0,      0,      0],
        [     0,      0,      0,  ...,      0,      0,      0],
        [     0,      0,      0,  ...,      0,      0,      0],
        ...,
        [311296, 311296, 311296,  ..., 311296, 311296, 311296],
        [311296, 311296, 311296,  ..., 311296, 311296, 311296],
        [311296, 311296, 311296,  ..., 311296, 311296, 311296]])
```

# Reason

- `resize_outputs_` is set to `false` in `reduce_op`, but the dimension is still coalesced during `TensorIterator::build()`

899a075b25/aten/src/ATen/native/TensorIterator.cpp (L703-L715)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38946

Differential Revision: D21751663

Pulled By: ngimel

fbshipit-source-id: 6d55e4bb783423b4c2df09cd3e8b87147efcbfdb
2020-05-28 19:42:07 -07:00
527ee63b7d fused convbn: respect the strict argument when loading from state dict (#39205)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39205

Context:
* https://github.com/pytorch/pytorch/pull/38478 modified convbn folding logic
* https://github.com/pytorch/pytorch/pull/38820 fixed the ^ to be backwards compatible and be able to load v1 state dicts

This PR is an additional fix on backwards compatibility - it allows
older dicts to be loaded with `strict == False`.  This is important
because there are teams who use this flow to load floating point
checkpoints into fused models, with `strict == False`

Test Plan:
1. save a floating point and corresponding fused model:
https://gist.github.com/vkuzo/177eba811a7a2ac359054fe9d4e3f099
2. load both of them, it works with strict==False and the floating point
one fails with a good error message with strict==True:
https://gist.github.com/vkuzo/447c9e797f208cb98447ffb24359d73e

Imported from OSS

Differential Revision: D21774353

fbshipit-source-id: f85f0c7fa956561824c9addb9198fea7a76a91aa
2020-05-28 19:25:45 -07:00
98a755bc8f Migrate AT_DISPATCH_FLOATING_AND_COMPLEX_TYPES_AND1 to c10::complex (#39045)
Summary:
No special changes are needed for CPU kernels, some CUDA kernels are still doing `c10::complex -> thrust::complex` casting, this will be cleaned up later. But for now, it will be good to just keep it as is, and change the dispatch macro first.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39045

Differential Revision: D21741151

Pulled By: anjali411

fbshipit-source-id: 748f057f9f33338b8c9293aeaa228ad861172e71
2020-05-28 18:48:05 -07:00
41363b299a test_bottleneck_cuda works on ROCm 3.3 (#38249)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/38249

Differential Revision: D21665097

Pulled By: ailzhang

fbshipit-source-id: cb2deab2fe8305db6fbe9ac4bfce4bb01cd9ff29
2020-05-28 17:48:29 -07:00
0e8c65f756 Add timeout to TestBottleneck (#39191)
Summary:
Invoke `Popen.communicate` with `timeout` argument and kill the process in `TimeoutExpired` handler
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39191

Differential Revision: D21773510

Pulled By: malfet

fbshipit-source-id: 52b94315f8aa4d6c330dd5c9a8936100e49aef2d
2020-05-28 16:08:16 -07:00
9c19a12965 fix asserts in cuda code (#39047)
Summary:
Gets rid of some in-kernel asserts where they can be replaced with static_asserts
Replaces bare in-kernel `assert` in one case with `CUDA_KERNEL_ASSERT` where necessary
replaces host code `assert`s with `TORCH_INTERNAL_ASSERT`
Another group of asserts is in fractional max pooling kernels which should be fixed regardless https://github.com/pytorch/pytorch/issues/39044, the problems there are not just asserts.
I've audited remaining cases of in-kernel asserts, and they are more like `TORCH_INTERNAL_ASSERT`, so they should not happen with invalid user data. I think it's ok to leave them as is.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39047

Differential Revision: D21750392

Pulled By: ngimel

fbshipit-source-id: e9417523a2c672284de3515933cb7ed166e56719
2020-05-28 15:51:38 -07:00
0b9d537056 [dper][pruning] add histogram op (#38514)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38514

this diff introduces the `Histogram` caffe2 op, which computes a histogram tensor for a list of input tensors. the bin edges of the histogram are defined by arg `bin_edges`.

Test Plan: tests

Reviewed By: chocjy

Differential Revision: D21553956

fbshipit-source-id: fc98c8db691d66d2dad57b6ad14867109913cb6f
2020-05-28 15:45:04 -07:00
928e99b9bb [vulkan] jni build support USE_VULKAN (#39188)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39188

Extracting Vulkan_LIBS and Vulkan_INCLUDES setup from `cmake/Dependencies.cmake` to `cmake/VulkanDependencies.cmake` and reuse it in android/pytorch_android/CMakeLists.txt

Adding control to build with Vulkan setting env variable `USE_VULKAN` for `scripts/build_android.sh` `scripts/build_pytorch_android.sh`

We do not use Vulkan backend in pytorch_android, but with this build option we can track android aar change with `USE_VULKAN` added.

Currently it is 88Kb.

Test Plan: Imported from OSS

Differential Revision: D21770892

Pulled By: IvanKobzarev

fbshipit-source-id: a39433505fdcf43d3b524e0fe08062d5ebe0d872
2020-05-28 15:39:02 -07:00
ee3bd10445 Moves angle/abs test to test_torch (#39154)
Summary:
Moves test (per request).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39154

Differential Revision: D21769706

Pulled By: mruberry

fbshipit-source-id: a09d0d0a47fbcf8f0e798d57230f2fe6a9ebf6b9
2020-05-28 14:55:40 -07:00
7b343cc30f .cirlceci: Remove setup job (#39081)
Summary:
The setup job isn't really what we need anymore so let's get rid of it
and remove the single point of failure from our build pipeline.

Should also resolve issues with CircleCI where re-run workflow from failed would trigger an entire re-run instead of only jobs that we actually want to re-run.

Signed-off-by: Eli Uriegas <eliuriegas@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39081

Differential Revision: D21770380

Pulled By: seemethere

fbshipit-source-id: 92a239deb6f2908eb46d519c332dc34c6023da6d
2020-05-28 14:46:39 -07:00
feaf72088c short-circuit pow for complex 1 and 0 exponents (#39117)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/39117

Test Plan: Imported from OSS

Differential Revision: D21764704

Pulled By: nairbv

fbshipit-source-id: 7f501b429aa63d121f4841fbb0ef3378b911afcd
2020-05-28 14:28:15 -07:00
5e975cf8d6 Stops cross-device data movement in tensor iterator (#38998)
Summary:
**BC-breaking note:**

In previous versions of PyTorch zero dimensional CUDA tensors could be moved across devices implicitly. For example,

```
torch.tensor(5, device='cuda:0') + torch.tensor((1, 1), device='cuda:1')
```

would work, even though the tensors are on different CUDA devices. This is a frequent source of user confusion, however, and PyTorch generally does not move data across devices without it being explicit. This functionality is removed in PyTorch 1.6.

**PR Summary:**

Today in PyTorch we allow implicit data movement of zero dimensional CUDA tensors. For example, we allow:

```
torch.tensor(5, device='cuda:0') + torch.tensor((1, 1), device='cuda:1')
```

and

```
torch.tensor(2, device='cuda') + torch.tensor((3, 5))
```

In both of these cases TensorIterator would move the zero dim CUDA tensor to the device of the non-scalar tensor (cuda:1 in the first snippet, the CPU in the second snippet).

One of PyTorch's fundamental rules, however, is that it does not perform implicit data movement like this, and this change will causes these cases to throw an error. New tests for this behavior are added to test_torch.py, and tests of the old behavior are removed in test_torch.py and test_autograd.py. A cpp test in tensor_iterator_test.cpp is modified to account for the new behavior.

This addresses https://github.com/pytorch/pytorch/issues/36722.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38998

Differential Revision: D21757617

Pulled By: mruberry

fbshipit-source-id: 2498f07f4938d6de691fdbd5155ad2e881ff7fdb
2020-05-28 13:53:57 -07:00
d26f7f09b5 Fixup: rename BatchedTensorKey to Batched (#38798)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38798

This makes it more in-line with the other keys in the file
(DispatchKey.h).

Test Plan: Imported from OSS

Differential Revision: D21691789

Pulled By: zou3519

fbshipit-source-id: 8d8b902360c0238f67bd0e58f9d969cec4b63320
2020-05-28 13:47:09 -07:00
e029d678b6 Make collect_env more robust on Windows (#39136)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/39133.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39136

Differential Revision: D21763686

Pulled By: zou3519

fbshipit-source-id: d45c3b529f569554e987dfd29579fc93d4894aaf
2020-05-28 13:25:36 -07:00
5267b17a96 Revert D21748644: [pytorch][PR] Fix index overflow in ConvTranspose3d
Test Plan: revert-hammer

Differential Revision:
D21748644

Original commit changeset: 95060423219d

fbshipit-source-id: 73c53c8a27a29bc8edd5b9b8c80f0f938b04a845
2020-05-28 13:08:35 -07:00
b98948e6dd implement dynamic bucket order in DDP (#35137)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35137

bucket order is rebuilt dynamically in the first reduction backward pass when find_unused_parameters = false
ghstack-source-id: 104794018

Test Plan: unit test

Differential Revision: D20128537

fbshipit-source-id: fad73de965cdcb59a51c0a12b248271344584b9f
2020-05-28 12:59:52 -07:00
7e1cc2daa5 Revert D21729544: add overload name for op eq.str
Test Plan: revert-hammer

Differential Revision:
D21729544

Original commit changeset: cf86f5eb101b

fbshipit-source-id: 4d8610fca30e6aaa49fff29741ab56e0dc349cfe
2020-05-28 12:27:12 -07:00
c2133179a9 add overload name for op eq.str
Summary:
See D21681838
There are two "aten::eq" in lite interpreter. Add overload name for op eq.str.

Test Plan: CI

Reviewed By: iseeyuan

Differential Revision: D21729544

fbshipit-source-id: cf86f5eb101bb0530a3dca4051f8fe14ee184f9c
2020-05-28 11:37:24 -07:00
f58cc4b444 [RPC] Fix flaky test by waiting for async rref calls (#39012)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39012

The `test_rref_context_debug_info` test was flaky with the TensorPipe agent, and I think the issue is the test itself.

What was happening is that on line 1826 the test was clearing a global variable on the remote side which was holding a rref. Even though the RPC call that unset the global variable was synchronous, the messages that the rref context needs to send around to delete that rref are asynchronous. Therefore, sometimes, when we reached line 1845 we saw the following check fail:
```
        self.assertEqual(2, int(info["num_owner_rrefs"]))
```
because `num_owner_rrefs` was still 3, as the deletion hadn't yet been processed.

The only way I found to fix it is to add a synchronization step where we wait for all the futures from the rref context to complete. Since we must wait for this to happen on all workers, we synchronize with a barrier.
ghstack-source-id: 104810738

Test Plan: The test isn't flaky anymore.

Differential Revision: D21716070

fbshipit-source-id: e5a97e520c5b10b67c335abf2dc7187ee6227643
2020-05-28 10:48:34 -07:00
377a355bcc [TensorPipe] Detect duplicate worker names (#39011)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39011

There's a test for this, so let's implement it. It's very easy.
ghstack-source-id: 104810739

Test Plan: The test now passes.

Differential Revision: D21716068

fbshipit-source-id: 1080040b12913ea0dcc4982182d6b3f6d9ac763c
2020-05-28 10:48:29 -07:00
72f2ff5950 [TensorPipe] Improve serialization (#39010)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39010

The initial version of the serialization for the TensorPipe RPC agent (i.e., the conversion from rpc::Message to tensorpipe::Message) worker around a limitation of TensorPipe of only allowing one payload per message by pickling each tensor separately and storing the pickles as metadata (which is a less efficient way of sending data over, as it goes through more copies). Having now lifter that limitation we can now improve the way we serialize. We now put the type and the id as their own payloads, we do a single pickling pass for all the tensors of the message (which allows us to deduplicate them) and store the pickle as a payload. My impression is that pickling is a somewhat costly operation, so reducing the number of times we do it should be beneficial for performance. For this same reason, another change I've done here is separate the allocation of the buffers from the deserialization. This will allow us (in the future) to perform the allocation on the I/O event loop but perform the unpickling in the worker thread, thus keeping the event loop more responsive.
ghstack-source-id: 104810740

Test Plan: RPC tests

Differential Revision: D21716067

fbshipit-source-id: c1475cc78afdcf0820a485ffd98c91abb35796c7
2020-05-28 10:48:24 -07:00
65aa2b65e5 [TensorPipe] Close and join TP context at shutdown (#38934)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38934

The TensorPipe context contains all the threads and global state. It needs to be closed and joined upon shutdown (joining implicitly closes it). Destructing the context implicitly joins it, which is what was happening so far: we were waiting for the RPC agent to be destroyed for the TP context to be closed. However, I was seeing some TSAN errors that seemed to be happening during the process termination, where the SHM reactor thread was trying to log something on GoogleLog while a static member of GoogleLog was being destructed. I suspect this means that this means that the TP agent was being "leaked" (probably because the `RpcAgent::currentRpcAgent_` static field was still storing it) and thus was destroyed too late. The obvious solution seems to be to destroy it earlier, when GoogleLog is still active.

Test Plan:
I guess land this and see if the TSAN flakes keep happening?

testinprod

Differential Revision: D21703016

fbshipit-source-id: d117e619bb835192b1f3c8e2eb3cee94dbdb050f
2020-05-28 10:48:18 -07:00
54046c1024 [TensorPipe] Implement join correctly (#38933)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38933

Based on what I could understand from how the RPC shutdown operates and from what the ProcessGroup agent does, the join method is supposed to act as a barrier among all workers that waits until they all have finished all their pending work, including work that may be triggered by nested calls or by callbacks.

ghstack-source-id: 104760684

Test Plan: Before this diff, the `test_user_rrefs_confirmed` test of the RPC suite was flakily deadlocking. After this, I haven't been able to repro that.

Differential Revision: D21703020

fbshipit-source-id: 3d36c6544f1ba8e17ce27ef520ecfd30552045dd
2020-05-28 10:48:13 -07:00
49e4e41fdc [TensorPipe] Always complete futures from thread pool (#38930)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38930

Any time we mark a future as complete or set an error on it we call its callbacks, which could be arbitrary user functions and could thus be slow or blocking. The safest behavior is to always defer to the loop.
ghstack-source-id: 104760682

Test Plan: None... :(

Differential Revision: D21703017

fbshipit-source-id: ad2bdc6be25844628ae6f318ef98b496f3d93ffd
2020-05-28 10:48:07 -07:00
eaca6f32b0 [TensorPipe] Do not mark future messages as complete after they have timed out (#38931)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38931

When requests time out they are not aborted, so they could in fact still complete successfully but, when they do so, they try to mark an errored future as complete, which causes an error. I don't see any atomic way of doing future->markCompleteIfNeeded, so we implement it on top of it on our side.
ghstack-source-id: 104760689

Test Plan: Hit this error in the RPC test suite, and it disappeared after this fix.

Differential Revision: D21703015

fbshipit-source-id: af92f7819ed907efb9b068a4ca65420739fac8cc
2020-05-28 10:48:02 -07:00
510971f86c [TensorPipe] Fix lock inversion upon response read error handling (#38929)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38929

Fixes a TSAN error that was reported by the internal tests.

Test Plan: None... :(

Differential Revision: D21703022

fbshipit-source-id: 54480d32d8c19db01d9608a52b7b906a622ca8b2
2020-05-28 10:47:56 -07:00
0413e1e624 [TensorPipe] Fix timeout computation (#38928)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38928

The original code was
```
steady_clock_time_point earliestTimeout = std::chrono::steady_clock::now() + kLargeTimeDuration;
if (std::chrono::steady_clock::now() >= earliestTimeout) {
  break;
}
if (!timeoutMap_.empty()) {
  earliestTimeout = timeoutMap_.begin()->first;
}
timeoutThreadCV_.wait_until(lock, earliestTimeout);
```
which meant we'd never break the loop, as that required `std::chrono::steady_clock::now()` to be *smaller* than `std::chrono::steady_clock::now() + kLargeTimeDuration`.

The fixed code looks like:
```
steady_clock_time_point earliestTimeout = std::chrono::steady_clock::now() + kLargeTimeDuration;
if (!timeoutMap_.empty()) {
  earliestTimeout = timeoutMap_.begin()->first;
}
if (std::chrono::steady_clock::now() >= earliestTimeout) {
  break;
}
timeoutThreadCV_.wait_until(lock, earliestTimeout);
```
but by staring at it for a second it becomes clear that the code behaves very differently based on whether `timeoutMap_.empty()`, so I think that for better readability we should reflect that in the code, making that `if` the main one. This then allows us to do a timeout-less wait if there are no messages, which avoids the hacky `kLargeTimeDuration`.
ghstack-source-id: 104760685

Test Plan: eyes

Differential Revision: D21703021

fbshipit-source-id: 0c5062b714c92b956376ae2a8372223fd0d9f871
2020-05-28 10:47:50 -07:00
7866854184 [TensorPipe] Add cases for TP in RPC test helpers (#38927)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38927

Since the regexs weren't matching the RPC tests would never confirm that the remote end had correctly shut down and were thus retrying in a loop forever.
ghstack-source-id: 104760686

Test Plan: Ran the RPC test suite after re-enabling some of the TensorPipe tests

Differential Revision: D21703018

fbshipit-source-id: 3e4b8d22810e58c9d72c4317dcf5ba68d6e0b258
2020-05-28 10:47:44 -07:00
7b90ed1117 [TensorPipe] Pass names of endpoints to context/pipe for easier debugging (#38926)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38926

TensorPipe supports for the user to provide a meaningful name for each context and to specify what it thinks the name of the endpoint it's connecting to is, so that these names can be logged and matched to the otherwise not-very-informative ID of a pipe (given by the PID and some counters) for easier debugging.
ghstack-source-id: 104760688

Test Plan: Ran RPC tests with `TP_VERBOSE_LOGGING=1`.

Differential Revision: D21479799

fbshipit-source-id: 856d2ffac239a3f9b11318a92ba4534133865dc8
2020-05-28 10:45:48 -07:00
1d1f16079d [quant] Add save/load state_dict to quantized dynamic RNNs (#39105)
Summary:
Previously dynamic LSTM modules weren't able to save/load from state_dict since PackedParameter used in RNNs isn't serializable from python
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39105

Test Plan: python test/test_quantization.py TestSerialization

Reviewed By: jerryzh168

Differential Revision: D21752256

Pulled By: supriyar

fbshipit-source-id: ef82cf21ce21a3a1304d147ed0da538c639f952d
2020-05-28 10:37:38 -07:00
78acc9dffb Check reinterpret_cast of complex bidirectional (#38882)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/38882

Differential Revision: D21690131

Pulled By: anjali411

fbshipit-source-id: 5634f79e5a0248843625bb4eb69e854359e5d7ef
2020-05-28 09:09:39 -07:00
bfcb687b9c Nearest interpolation gpu implementation fix [Resolves issue #38985] (#39055)
Summary:
fix nearest upsample dgrad bug, where window computation was wrong previously;
fix python test where previously GPU implementation was not tested;
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39055

Differential Revision: D21763242

Pulled By: albanD

fbshipit-source-id: 9b1d5365f40176450f529136110542fd36bd7f58
2020-05-28 08:07:14 -07:00
5702a28b26 Fix index overflow in ConvTranspose3d (#38970)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/32866

The memory error in the issue is caused by `int` overflowing in `col2vol`. This version using mixed 32-bit and 64-bit indexing calculation lifts the maximum indexing possible without compromising the performance of `ConvTranspose3d`. vs 20-30% regression with pure 64-bit indexing.

This requires that `input.numel() <= UINT_MAX`, and `channels * kernel.numel() <= UINT_MAX` otherwise it raises an error. Previously, the code would crash or give incorrect results unless `input.numel() * kernel.numel() <= INT_MAX`.

Note that the test is a minimised reproducer for the issue.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38970

Differential Revision: D21748644

Pulled By: ezyang

fbshipit-source-id: 95060423219dc647595e1a24b3dcac520d3aecba
2020-05-28 07:28:15 -07:00
7e16dd299a [ROCm] enable mem leak check for rocm (#35953)
Summary:
CC iotamudelta
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35953

Differential Revision: D21742926

Pulled By: zou3519

fbshipit-source-id: f18534dbb88a84fe98b8d85ce8fde652916a72d5
2020-05-28 07:05:47 -07:00
0d4eefcd82 fix comments in gradcheck (#38877)
Summary:
Follow up to https://github.com/pytorch/pytorch/issues/38774
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38877

Differential Revision: D21697680

Pulled By: albanD

fbshipit-source-id: f7cf6fb79f56eac2afceec7167c26e25f20a665d
2020-05-28 06:30:27 -07:00
e088902b4a Add type-hint check for default arguments in TorchScript C++ frontend (#39021)
Summary:
This PR fixes https://github.com/pytorch/pytorch/issues/39020 by requiring users to type-hint default arguments to a TorchScript when using the C++ frontend (the Python frontend will insert those automatically).

Since this is a bit of a niche use case, I opted for the simpler solution of making type-hints mandatory for default arguments, as opposed to trying to type-infer them. I left a comment in the code justifying this choice.

Test is included.

/cc t-vi
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39021

Differential Revision: D21755317

Pulled By: suo

fbshipit-source-id: e007650d3bfb3a4c58c25ad2c3a17759898f303b
2020-05-28 01:42:04 -07:00
7543e7e558 Migrate minall, max, maxall from THC to ATen and cleanup THC (#39029)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/36900 fixes https://github.com/pytorch/pytorch/issues/24594
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39029

Differential Revision: D21747599

Pulled By: ngimel

fbshipit-source-id: 9c18876f2ceb0e36db4e043acdb813bfe7ccf6d1
2020-05-28 01:27:40 -07:00
f5bc91f851 Get rid of multiple inheritence in test_torch (#39110)
Summary:
`_TestTorchMixin` is base class which is instantiated across multiple types.
It was inherited from `object` in order to hide it from unittest test discovery mechanism.
But this approach makes it almost impossible to use static code analyzer on the class.
This PR implements alternative approach by hiding base class into inner class, per https://stackoverflow.com/a/25695512

Change imported class access path in `test_cuda.py`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39110

Test Plan:
run `test_torch.py --discover-tests` and `test_cuda.py --discover-tests` before and after change:
```
$ python test_torch.py --discover-tests|md5sum
2ca437bb5d65700763ce04cdacf6de3e  -
$ python test_cuda.py --discover-tests|md5sum
b17df916fb0eeb6f0dd7222d7dae392c  -
```

Differential Revision: D21759265

Pulled By: malfet

fbshipit-source-id: b01b06111469e551f7b78387449975e5248f6b9e
2020-05-27 22:45:06 -07:00
01815be1e4 Infinite timeout for operations against ProcessGroup for RPC (#38577)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38577

We don't want to limit a timeout to 30 min since there could be no
operations within that time frame. Bump to 2^31 - 1 (int32 max)
ghstack-source-id: 104743727

Test Plan: CI

Differential Revision: D21602425

fbshipit-source-id: ab002262f01664b538761202b3bd7584fcee3c6b
2020-05-27 22:35:13 -07:00
b0420cc2de [Caffe2] Change shape_hints format (#39100)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39100

The old shape_hints format has a few cons:
- ',' is used to separate <model_id>:<shape_hints> pairs, as well as delimiter for dims in the <shape_hints>, which is an obvious bug
- it cannot handle the case of having ':' in tensor names

The new shape_hints format uses '::' to delimit <model_id> and <shape_hints>, ';' to delimit <model_id>::<shape_hints> pairs. Inside <shape_hints>, '|' is used to separate <tensor>,<shape> pairs, and ',' is used to delimit <tensor> and <shape>, as well as the dimensions inside <shape>.

Test Plan:
```
buck test //caffe2/caffe2/fb/opt:shape_info_utils_test
```

AI/AF canary:
https://www.internalfb.com/intern/ads/canary/426980448937212687
https://www.internalfb.com/intern/ads/canary/426980529105312403

Reviewed By: yinghai

Differential Revision: D21656832

fbshipit-source-id: 9dec4b5586d093ddb814c3f15041a57d45a3de76
2020-05-27 21:55:25 -07:00
05f097b5bb Implement logaddexp (#38384)
Summary:
Resolve https://github.com/pytorch/pytorch/issues/38377
Related https://github.com/pytorch/pytorch/issues/38349

This op should be disambiguated with `logsumexp` which do a reduction on a tensor over a specific axis.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38384

Differential Revision: D21737336

Pulled By: mruberry

fbshipit-source-id: 7864d04ca304c0fb2937bb083583e3e3d6ef205d
2020-05-27 20:27:31 -07:00
90a8cdfdbf Automatic update of fbcode/onnx to eae3eb8c61cf5ad27cc9a416dbdc5274982385a6 (#39089)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39089

Previous import was 79a7e0df7e86e0f32e7a05f563b24a566540c18b

Included changes:
- **[eae3eb8c](https://github.com/onnx/onnx/commit/eae3eb8c)**: Use cmake GNUInstallDirs (#2661) <Gustavo Alvarez>
- **[106821e9](https://github.com/onnx/onnx/commit/106821e9)**: Update sequence test case so input is not scalar and splits are specified (#2675) <Scott McKay>
- **[e094e101](https://github.com/onnx/onnx/commit/e094e101)**: Remove unnecessary copies and std::move (#2684) <Changming Sun>
- **[71145275](https://github.com/onnx/onnx/commit/71145275)**: Update Batchnorm test (#2674) <Lara Haidar>
- **[da13be2d](https://github.com/onnx/onnx/commit/da13be2d)**: Rename OPTIONAL to OPTIONAL_VALUE (#2682) <Changming Sun>
- **[2987fa06](https://github.com/onnx/onnx/commit/2987fa06)**: Adding CI for ONNX Debug mode (Linux, OSX) (#2651) <Vinitra Swamy>
- **[46fe392d](https://github.com/onnx/onnx/commit/46fe392d)**: Update Pow input types in Opset 12 (#2666) <Lara Haidar>
- **[ac1caf3b](https://github.com/onnx/onnx/commit/ac1caf3b)**: Change type of label tensor to int32/int64 in SoftmaxCrossEntropyLoss spec. (#2667) <M. Zeeshan Siddiqui>
- **[c2fefcbf](https://github.com/onnx/onnx/commit/c2fefcbf)**: [Training] SG with Momentum Optimizer (#1959) <Wei-Sheng Chin>
- **[8d15705e](https://github.com/onnx/onnx/commit/8d15705e)**: [Training] Add Adagrad optimizer operator (#1955) <Wei-Sheng Chin>
- **[94b01cdd](https://github.com/onnx/onnx/commit/94b01cdd)**: Suppress a warning in unsqueeze (#2637) <Hong Xu>
- **[0582d526](https://github.com/onnx/onnx/commit/0582d526)**: Fix Greater/LessOrEqual function definition (#2645) <Takeshi Watanabe>
- **[b852d819](https://github.com/onnx/onnx/commit/b852d819)**: Increment version number to 1.7.0 (#2639) <Chin Huang>
- **[ff4bb553](https://github.com/onnx/onnx/commit/ff4bb553)**: Regenerate Min test data (#2644) <Takeshi Watanabe>

Test Plan: ci

Reviewed By: hl475

Differential Revision: D21750299

fbshipit-source-id: c33ec1b1e0dc65d0187e78db96d749f9037aae9c
2020-05-27 18:55:48 -07:00
988e31c788 Revert D21752017: [pytorch][PR] Test PyTorch using python-3.8 + GCC-9 on Bionic
Test Plan: revert-hammer

Differential Revision:
D21752017

Original commit changeset: 56c841636349

fbshipit-source-id: adf08e03ba9610050fc5440ef453789f805fdc6b
2020-05-27 17:42:22 -07:00
d92ef9268d Revert D21728402: Simplify precision-specification in tests.
Test Plan: revert-hammer

Differential Revision:
D21728402

Original commit changeset: 85f3daf63f1b

fbshipit-source-id: 4e2a36aca15cd8d842985173395b4e1cac7135d8
2020-05-27 17:34:28 -07:00
cf8001d2d0 [TensorExpr] Fix a bug in Rfactor when there are multiple reductions (#38733)
Summary:
In `LoopNest::rfactor` we assume that there is only a single reduction below the insertion point, and when replacing the reduction we recursively replace all reductions below that point. This is not a safe assumption, as a number of transformations can introduce additional ReduceOps - most directly a `splitWithTail` on the innermost reduce axis.

This PR fixes that bug, and adds some unit tests covering the case.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38733

Differential Revision: D21723634

Pulled By: nickgg

fbshipit-source-id: 3ed6ffcdc2c15aef7504f9b2b91e8d827e0b5d88
2020-05-27 16:49:34 -07:00
0f1f0a1f35 update circleci scripts for rocm ubuntu bionic support (#39097)
Summary:
CC ezyang xw285cornell sunway513
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39097

Differential Revision: D21753340

Pulled By: ezyang

fbshipit-source-id: 9cd84e9c47c08702d7b67d071dc88c345b9db85c
2020-05-27 16:33:09 -07:00
b12a879184 Correct Javadoc link to master (#39038)
Summary:
Correct Javadoc link to match the 1.4 version: https://github.com/pytorch/pytorch/blob/release/1.4/docs/source/index.rst
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39038

Differential Revision: D21747969

Pulled By: jlin27

fbshipit-source-id: 941b61204e9be53e15a6351eff6f4935e6a16d24
2020-05-27 16:21:30 -07:00
30dd4acbf6 Test PyTorch using python-3.8 + GCC-9 on Bionic (#39030)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/39030

Differential Revision: D21752017

Pulled By: malfet

fbshipit-source-id: 56c841636349e24c9ebef8dac18c283de3664fa5
2020-05-27 15:56:37 -07:00
fa184c351f [JIT][to-backend] Fix compilation unit and name mangling of generated module (#38679)
Summary:
**Summary**
This commit gets rid of the separate compilation unit that is currently
being created for every backend-specific module generated by
`jit::backend::generateToBackendFn` and mangles the name properly to
allow multiple backend-specific modules to coexist in the same
compilation unit.

**Test Plan**
`python test/test_jit.py TestBackends`

**Fixes**
This pull request fixes part of https://github.com/pytorch/pytorch/issues/37841.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38679

Differential Revision: D21744620

Pulled By: SplitInfinity

fbshipit-source-id: ac85b8ce0d179c057991e9299fd53a4e13ba02a9
2020-05-27 15:40:51 -07:00
ff2e29144c Refactor backward compatibility tests to use override_qengines decorator (#38838)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/38838

Test Plan: Imported from OSS

Differential Revision: D21676032

Pulled By: durumu

fbshipit-source-id: 5cbe56e0d72d322f540bccffb60bcdbb15385ee8
2020-05-27 15:37:47 -07:00
20397285c6 Replace use of np.allclose in tests. (#34287)
Summary:
fixes https://github.com/pytorch/pytorch/issues/34096
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34287

Differential Revision: D21735525

Pulled By: ailzhang

fbshipit-source-id: 611da17cfc5a3fee77d482abccf8f9854f504263
2020-05-27 15:29:35 -07:00
898d062bfd [disagg_acc] In batch broadcast (#38700)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/38700

Reviewed By: yinghai

Differential Revision: D21634147

fbshipit-source-id: 7bd1912654e2433cfb580b5f7a9fb86570a55cab
2020-05-27 15:21:37 -07:00
4239416c72 Throws runtime error on attempted addcdiv integer division (#38762)
Summary:
1.6 Deprecation Note:

In 1.6 attempting to perform integer division using addcdiv will throw a RuntimeError, and in 1.7 the behavior will change so that addcdiv always performs a true division of its tensor1 and tensor2 inputs. See the warning in torch.addcdiv's documentation for more information.

PR Summary:

This PR updates the warning that appears when addcdiv performs integer division to throw a RuntimeError. This is intended to prevent silent errors when torch.addcdiv's behavior is changed to always perform true division in 1.7. The documentation is updated (slightly) to reflect this, as our the addcdiv tests in test_torch and test_type_promotion.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38762

Differential Revision: D21657585

Pulled By: mruberry

fbshipit-source-id: c514b44409706f2bcfeca4473424b30cc48aafbc
2020-05-27 14:40:07 -07:00
bb12e4dca0 Add JIT fusion pass to fuse quantized add and relu. (#38897)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38897

Quantized ops support add_relu. This pass enables finding quantized add + relu
pattern and fuse them to add_relu.

Test Plan: buck run caffe2/test:quantization -- test_quantization.TestFusionPasses

Reviewed By: jerryzh168

Differential Revision: D21690909

fbshipit-source-id: 607cf72dde535df15eb7638841543ab2156af464
2020-05-27 14:16:57 -07:00
248758d702 Expose qnnpack's maxpool when going through aten::max_pool2d (#38896)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38896

Current way of exposing qnnpack's maxpool2d only works if max_pool2d op is
quantized::max_pool2d. This diff moves the function about to expose it via
aten::max_pool2d when dispatch key is QuantizedCPU.

Test Plan: Quantized tests.

Reviewed By: supriyar

Differential Revision: D21690913

fbshipit-source-id: 75fb77329b915e3a3c3aac4d76359482976ca783
2020-05-27 14:14:35 -07:00
c6e9e9359f [Codemod][GleanFbcode] Remove dead includes in caffe2/test (#39023)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/39023

Reviewed By: orionr

Differential Revision: D21702529

fbshipit-source-id: 6945bba95609102409850b105a8a091e33b8acc9
2020-05-27 14:07:26 -07:00
c835dedce9 Fix the issue that PyTorch doesn't construct bool tensors from non-bo… (#38392)
Summary:
…ol values correctly(https://github.com/pytorch/pytorch/issues/37398)

Signed-off-by: chengjinfang <chengjf@cn.fujitsu.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38392

Differential Revision: D21737009

Pulled By: mruberry

fbshipit-source-id: c77d8c940af95f5011fe008b48ea0d16c3f501d1
2020-05-27 13:59:28 -07:00
30063347e7 remove serial_exec from scatter/gather kernel (#36181)
Summary:
Since the indexed dimension in `scatter/gather` is traversed inside the kernel, all the memory conflicts of writing to the same memory between the threads are actually mutually disjoint.
See [this comment](https://github.com/pytorch/pytorch/issues/33389#issuecomment-590017938) for a graphical explanation. More formal description:
Suppose we deal with 3D tensors and `dim=0`, hence the `scatter_add` operations are
```
self[index[i][j][k]][j][k] += src[i][j][k],
...
self[index[i'][j'][k']][j'][k'] += src[i'][j'][k'],
...
```
Clearly, write/read to the same memory happens if and and only if:
```
index[i][j][k] = index[i'][j'][k'],
j = j',
k = k'.
```
Since the reduction over `dim=0` happens inside the kernel, threads `i` and `i'` partition `dim=1,2`. It means that threads `i` and `i'` receive indices
```
I = {(*, i, k) sent to the thread i},
I' = {(*, i', k') sent to the thread i'},
I intersection with I' = the empty set.
```

This happens:
```
index[i][j][k] = index[i'][j'][k'],
j = j',
k = k',
```
if and only if there exists some thread k which receives indices K and
`(*,j,k),(*,j',k') in K`.

Therefore it is possible to make `scatter_add` parallel and remove `serial_exec` from the `scatter_gather_base_kernel`.

CC v0dro
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36181

Differential Revision: D21716167

Pulled By: ngimel

fbshipit-source-id: 49aee2de43779a1f0b359c22c8589c0702ee68a2
2020-05-27 13:28:00 -07:00
b636f5e324 change the int8 test to use unquantized bias (fp32)
Summary: change the test default to test the version we care about

Test Plan: ran the test

Reviewed By: amylittleyang

Differential Revision: D21725194

fbshipit-source-id: 243fcdf1dd5784768f6ceb2b46f9f1c9e64341eb
2020-05-27 12:23:39 -07:00
df4066bbb6 Simplify precision-specification in tests. (#37181)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37181

Now that assertEquals considers dtypes in determining tolerance, most
tests don't need explicitly set precision.

Those that do are a few half precision tests on cuda. In this PR, those
are broken out to be handled explicitly, though we may also want to
consider further loosening the tolerance on half-precision.

Test Plan: Imported from OSS

Differential Revision: D21728402

Pulled By: nairbv

fbshipit-source-id: 85f3daf63f1bdbb5101e8dea8c125f13448ca228
2020-05-27 12:05:33 -07:00
1c74d965ed Fix attribute warning on gcc (#38988)
Summary:
When building, my log was being spammed with:
```
warning: attribute "__visibility__" does not apply here
```

Which, at least on gcc 7.4 isn't covered by silencing `-Wattribute`. The warning suggests `enum`s don't need to be exported on linux, so I just `ifdef` it out instead.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38988

Differential Revision: D21722032

Pulled By: ezyang

fbshipit-source-id: ed4cfebc187dceaa9e748d85f756611fd7eda4b4
2020-05-27 11:59:06 -07:00
3d2fce6bc3 Change len(DataLoader) for IterableDataset (#38925)
Summary:
Fix https://github.com/pytorch/pytorch/issues/36176

One-liner change to ensure that ```len(loader) == (len(dataset) // batch_size)``` for IterableDataset.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38925

Differential Revision: D21731587

Pulled By: ezyang

fbshipit-source-id: 59a086165a004c0c1c8a1ee0776b1444bd26de23
2020-05-27 11:56:41 -07:00
53b55d8f38 Use ninja build as default for HIPExtensions (#38939)
Summary:
This PR adds the following changes:
1. It sets the default extension build to use ninja
2. Adds HIPCC flags to the host code compile string for ninja builds. This is needed when host code makes HIP API calls

cc: ezyang jeffdaily
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38939

Differential Revision: D21721905

Pulled By: ezyang

fbshipit-source-id: 75206838315a79850ecf86a78391a31ba5ee97cb
2020-05-27 11:35:19 -07:00
dfc4be205e Fix broken reference in sync bn doc (#38890)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/38890

Differential Revision: D21722162

Pulled By: ezyang

fbshipit-source-id: a7d18239917b2886fe8c1c0aaf42fc8491c8e10c
2020-05-27 11:30:48 -07:00
0edf063c24 Enable Constant Folding Tests (#38751)
Summary:
Enable tests for constant folding since constant folding is enabled for opset 12
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38751

Differential Revision: D21728013

Pulled By: ezyang

fbshipit-source-id: e0ed9ad62d8b781eacfdf894e8c9609fe7e778bd
2020-05-27 11:22:19 -07:00
e07ee1954d [TensorPipe Agent] Message on Agent Shutdown (#38819)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38819

Logs a message when the agent is shutting down like the other RPC
Agents.
ghstack-source-id: 104673386

Test Plan: Sandcastle

Differential Revision: D21671061

fbshipit-source-id: a44f0e4976e3acc898645a2baf6f41f45a697166
2020-05-27 11:09:45 -07:00
d08a30a300 [TensorPipe Agent] Improve Response Error Message on Agent Shutdown (#38818)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38818

Standardizes the error message when a response is attempted after the
agent has shut down.
ghstack-source-id: 104673115

Test Plan: Sandcastle - no functionality change, just error message

Differential Revision: D21670706

fbshipit-source-id: d26fcd7c76758c62d432d9c4e6ef2e3af7cbedff
2020-05-27 10:57:07 -07:00
1093e26d72 [ROCm] HIP version guard for occupancy API compatibility (#38551)
Summary:
CC ezyang xw285cornell

HIP from ROCm 3.5 renames `hipOccupancyMaxActiveBlocksPerMultiprocessor` to `hipModuleOccupancyMaxActiveBlocksPerMultiprocessor`.  In addition, the API parameter types now match CUDA.  Add these changes in a backwards-compatible manner.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38551

Differential Revision: D21721832

Pulled By: ezyang

fbshipit-source-id: 6fc971845e363d7495d8be9550e76d0f082c3062
2020-05-27 10:09:06 -07:00
626048efd3 Fix Windows binary jobs after migrating to the new circleci image (#39057)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/39057

Differential Revision: D21742971

Pulled By: albanD

fbshipit-source-id: a25ab8b01a9b7c1e2d14fe38227f85a5b8f0db83
2020-05-27 09:35:03 -07:00
7f1c9886cd [ONNX] Enable models tests (#38791)
Summary:
PR to enable model tests which are fixed.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38791

Reviewed By: hl475

Differential Revision: D21732498

Pulled By: houseroad

fbshipit-source-id: f417f9d4124ef5a663dc666d5c2ed6ba013b26a4
2020-05-27 09:09:59 -07:00
b789c1790f Update to use the stable windows image instead of the temporary one (#39066)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/39066

Differential Revision: D21742331

Pulled By: albanD

fbshipit-source-id: e4184b38eb69a289910e79808ebe9b2510dc6b06
2020-05-27 08:19:57 -07:00
d627f2b174 Support void return type in TensorIteratorDynamicCasting checks. (#38815)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38815

Some CPU kernels have void return types and the currently implementation segfaults on these cases.

Test Plan: Imported from OSS

Differential Revision: D21670717

Pulled By: gchanan

fbshipit-source-id: bc17b8330195601ca231a985ee44319447ba6cf0
2020-05-27 07:41:13 -07:00
45f32ceb4e Move needs_dynamic_casting to a non-CUDA specific file. (#38813)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38813

We are going to apply this check to CPU (with some changes), so just moving this in preparation.

The code is just cut-pasted here, no behavioral change.

Test Plan: Imported from OSS

Differential Revision: D21670554

Pulled By: gchanan

fbshipit-source-id: c7e07f67bb4c6524fde12237e35892e42557103e
2020-05-27 07:41:07 -07:00
bbb5e106ad Improve error checking of CUDALoops. (#38810)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38810

Same change as was applied to CPU loops -- separate out checking of the inputs and outputs.

Test Plan: Imported from OSS

Differential Revision: D21670339

Pulled By: gchanan

fbshipit-source-id: 42f208538dce1a5598d14948d8d02a1c91ba152a
2020-05-27 07:41:02 -07:00
b7882f9bd6 Improve cpu/Loops.h arity asserts. (#38809)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38809

This splits the asserts into separate input/output asserts and makes the numbers precise, instead of ranges.

This is an ongoing effort to improve the Loops assertion and to integrate dynamic cast checking into CPU loops.

Test Plan: Imported from OSS

Differential Revision: D21670263

Pulled By: gchanan

fbshipit-source-id: b1868db5255a69158045b759dc9171690a2dcd01
2020-05-27 07:38:58 -07:00
13120bf677 Updates assertEqual to require atol and rtol, removes positional atol (#38872)
Summary:
This updates assertEqual and assertEqual-like functions to either require both or neither of atol and rtol be specified. This should improve clarity around handling precision in the test suite, and it allows us to remove the legacy positional atol argument from assertEqual. In addition, the "message" kwarg is replace with a kwarg-only "msg" argument whose name is consistent with unittest's assertEqual argument.

In the future we could make "msg" an optional third positional argument to be more consistent with unittest's assertEqual, but requiring it be specified should be clear, and we can easily update the signature to make "msg" an optional positional argument in the future, too.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38872

Differential Revision: D21740237

Pulled By: mruberry

fbshipit-source-id: acbc027aa1d7877a49664d94db9a5fff91a07042
2020-05-27 06:31:07 -07:00
9b95f757af move num_profiled_runs to common_utils (#38687)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/38687

Differential Revision: D21634080

Pulled By: Krovatkin

fbshipit-source-id: 55513124caf3885e475ffecd9d9f3dbc4729a573
2020-05-27 01:14:01 -07:00
916084d933 [JIT] Allow @torch.jit.unused to be used on TS classes (#38522)
Summary:
**Summary**
This commit enables the use of `torch.jit.unused` on methods of TorchScript classes.
This attribute is honoured by replacing the body of any method
marked as unused in the parsed AST for the class with `raise Exception(...)`.

**Test Plan**
This commit adds a unit test `TestClassType.test_unused_method` that
tests this feature.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38522

Differential Revision: D21733818

Pulled By: SplitInfinity

fbshipit-source-id: 771872359dad70fac4aae83b6b5f17abb6329890
2020-05-26 23:21:54 -07:00
93d87a16eb Revert D21493165: Automatic update of fbcode/onnx to 20b3e10e6c3a9cdab90d2bb864d1c36d3e3651cd
Test Plan: revert-hammer

Differential Revision:
D21493165

Original commit changeset: 6863b289bfbf

fbshipit-source-id: 47b530c8ffceb3673a86b6cf0c064fe6af0eb72d
2020-05-26 21:35:29 -07:00
de8c888232 Fix torch.hub.hub_dir inconsistencies (#38969)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/38401

* `torch.hub.load_state_dict_from_url()` now also downloads to `$TORCH_HOME/hub/checkpoints` instead of `$TORCH_HOME/checkpoints` like `torch.hub.load()` and others.
* Make `hub_dir` private, add and use `get_dir()` instead.

Also updated docs. Did not see a need for additional unit tests.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38969

Differential Revision: D21725880

Pulled By: ailzhang

fbshipit-source-id: 58cc6b32ddbda91e58c1c1433cc3916223556ea1
2020-05-26 21:06:52 -07:00
2b789e2e03 [quant] Onnx export of quantized models with new API (#38736)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38736

qconv2d and qlinear APIs were changed recently so updating the scale code accordingly

Test Plan:
python test/onnx/test_pytorch_onnx_caffe2_quantized.py

Imported from OSS

Differential Revision: D21647724

fbshipit-source-id: 45d4b358ffb84f1e73da8ba3f702d5043bdb16d2
2020-05-26 21:01:18 -07:00
51274b501a Automatic update of fbcode/onnx to 20b3e10e6c3a9cdab90d2bb864d1c36d3e3651cd (#38203)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38203

Previous import was 79a7e0df7e86e0f32e7a05f563b24a566540c18b

Included changes:
- **[20b3e10e](https://github.com/onnx/onnx/commit/20b3e10e)**: Add 'ignore_index' input in the spec for SoftmaxCrossEntropyLoss and NLLLoss. (#2680) <M. Zeeshan Siddiqui>
- **[eae3eb8c](https://github.com/onnx/onnx/commit/eae3eb8c)**: Use cmake GNUInstallDirs (#2661) <Gustavo Alvarez>
- **[106821e9](https://github.com/onnx/onnx/commit/106821e9)**: Update sequence test case so input is not scalar and splits are specified (#2675) <Scott McKay>
- **[e094e101](https://github.com/onnx/onnx/commit/e094e101)**: Remove unnecessary copies and std::move (#2684) <Changming Sun>
- **[71145275](https://github.com/onnx/onnx/commit/71145275)**: Update Batchnorm test (#2674) <Lara Haidar>
- **[da13be2d](https://github.com/onnx/onnx/commit/da13be2d)**: Rename OPTIONAL to OPTIONAL_VALUE (#2682) <Changming Sun>
- **[2987fa06](https://github.com/onnx/onnx/commit/2987fa06)**: Adding CI for ONNX Debug mode (Linux, OSX) (#2651) <Vinitra Swamy>
- **[46fe392d](https://github.com/onnx/onnx/commit/46fe392d)**: Update Pow input types in Opset 12 (#2666) <Lara Haidar>
- **[ac1caf3b](https://github.com/onnx/onnx/commit/ac1caf3b)**: Change type of label tensor to int32/int64 in SoftmaxCrossEntropyLoss spec. (#2667) <M. Zeeshan Siddiqui>
- **[c2fefcbf](https://github.com/onnx/onnx/commit/c2fefcbf)**: [Training] SG with Momentum Optimizer (#1959) <Wei-Sheng Chin>
- **[8d15705e](https://github.com/onnx/onnx/commit/8d15705e)**: [Training] Add Adagrad optimizer operator (#1955) <Wei-Sheng Chin>
- **[94b01cdd](https://github.com/onnx/onnx/commit/94b01cdd)**: Suppress a warning in unsqueeze (#2637) <Hong Xu>
- **[0582d526](https://github.com/onnx/onnx/commit/0582d526)**: Fix Greater/LessOrEqual function definition (#2645) <Takeshi Watanabe>
- **[b852d819](https://github.com/onnx/onnx/commit/b852d819)**: Increment version number to 1.7.0 (#2639) <Chin Huang>
- **[ff4bb553](https://github.com/onnx/onnx/commit/ff4bb553)**: Regenerate Min test data (#2644) <Takeshi Watanabe>

Test Plan: ci

Reviewed By: hl475

Differential Revision: D21493165

fbshipit-source-id: 6863b289bfbf4235e36f0e2456ce44c776aaf164
2020-05-26 20:12:36 -07:00
362928d5dc Remove unneeded const from process group agent header (#38804)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38804

This is only needed in the process group agent implementation, and
removing it from the header file prevents other translation units that include
it from having this constant.
ghstack-source-id: 104666599

Test Plan: CI

Differential Revision: D21668514

fbshipit-source-id: 1c39cc98dea99518134c66dca3ca5b124a43de1b
2020-05-26 20:01:45 -07:00
a25062ab50 [TensorExpr] Fix elimination of For loops with empty bodies (#38883)
Summary:
We do try to eliminate empty For loops, but missed a case where the body Block exists but is empty. In that case we can eliminate the loop as well.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38883

Differential Revision: D21723680

Pulled By: nickgg

fbshipit-source-id: 49610b0524af5b9ec30ef3b4cc0c8461838259c3
2020-05-26 18:58:57 -07:00
4fcd1c3123 run te only for profiling executor (#38591)
Summary:
* Disable the mode where PE can still run the old fuser.
* Clean up
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38591

Differential Revision: D21643664

Pulled By: Krovatkin

fbshipit-source-id: 6753ed6bdc544698a1340e59a624608ff3abf7f9
2020-05-26 18:35:25 -07:00
63e545e0fe Revert D21717199: [pytorch][PR] Updates assertEqual to require atol and rtol, removes positional atol
Test Plan: revert-hammer

Differential Revision:
D21717199

Original commit changeset: 9feb856f94ee

fbshipit-source-id: bfde9c39a5ce99f0ca6183a7dde703c65b7c8259
2020-05-26 18:23:59 -07:00
ba14a701dc restore proper cuda assert behavior with DNDEBUG (#38943)
Summary:
Per title. https://github.com/pytorch/pytorch/issues/32719 essentially disabled asserts in cuda kernels in release build. Asserts in cuda kernels are typically used to prevent invalid reads/writes, so without asserts invalid read/writes are silent errors in most cases (sometimes they would still cause "illegal memory access" errors, but because of caching allocator this usually won't happen).
We don't need 2 macros, CUDA_ALWAYS_ASSERT and CUDA_KERNEL_ASSERT because all current asserts in cuda kernels are important to prevent illegal memory accesses, and they should never be disabled.
This PR removes macro CUDA_ALWAYS_ASSERT and instead makes CUDA_KERNEL_ASSERT (that is commonly used in the kernels) an asserttion both in release and debug builds.
Fixes https://github.com/pytorch/pytorch/issues/38771
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38943

Differential Revision: D21723767

Pulled By: ngimel

fbshipit-source-id: d88d8aa1b047b476d5340e69311e65aff4da5074
2020-05-26 18:11:00 -07:00
eddc3f61d0 Migrate Windows build jobs to VS 2019 for CUDA >= 10.1 (#38959)
Summary:
This PR relies on https://github.com/pytorch/pytorch/pull/38957 and https://github.com/pytorch/builder/pull/445.
Tested with pytorch/pytorch#38949 and pytorch/pytorch#38956.
Will need a rebase after the dependent commits go in
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38959

Differential Revision: D21732423

Pulled By: malfet

fbshipit-source-id: 50837a026a575bb3d547526e299db7bcfd7637a8
2020-05-26 16:37:52 -07:00
7e85f6f922 Removes pickle deprecation warning (#39003)
Summary:
As per [this issue](https://github.com/pytorch/pytorch/issues/38597), this is a one-line PR to remove the pickle deprecation warning.

cc stsievert driazati
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39003

Differential Revision: D21723048

Pulled By: ezyang

fbshipit-source-id: f6cc8f28b8140edd7b46d1f26f2d99819beb933e
2020-05-26 16:32:28 -07:00
44d418957e [vulkan] TensorConversions remove explicit vulkan ifs (#39019)
Summary:
As a follow up for https://github.com/pytorch/pytorch/pull/36491 and last comments on it.

Vulkan uses Strided Layout (at the moment strides are not supported, but in plan)
empty_strided just forwards to empty_vulkan, ignoring strides params.

Removing explicit ifs in TensorConversions that were added before decision to use Strided layout and have not been cleaned after that :(
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39019

Differential Revision: D21726480

Pulled By: IvanKobzarev

fbshipit-source-id: d465456df248a118bfef441c85280aa0025860cd
2020-05-26 16:27:02 -07:00
f188b52b59 Fix the issue that Bad interaction between no_grad and numpy conversi… (#38906)
Summary:
…on(https://github.com/pytorch/pytorch/issues/37000)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38906

Differential Revision: D21722033

Pulled By: albanD

fbshipit-source-id: f22aec8106e4546e828aba15be606e9d9f3eeffa
2020-05-26 16:18:58 -07:00
2e6ee853ab make onnx expect tests resiliant to producer_version changes (#39002)
Summary:
closes gh-32561 closes gh-38545. As part of the fallout from gh-36797, this PR
- replaces the producer_version: "1.6" in onnx expect tests with `producer_version: "XXX"
- adapts `testing/_internal/common_utils.py` with a regex to change the onnx producer_version so tests still pass

The consistency of the torch version and the onnx `producer_version` is tested in gh-36797, so there is no reason to test it again in the expect tests.

xref gh-38629 which documented how to run the onnx tests and at the same time refactored the Community documentation.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39002

Differential Revision: D21723062

Pulled By: ezyang

fbshipit-source-id: 1bd6a8ed37d5383e69d017226dc09c0645a69aff
2020-05-26 16:11:21 -07:00
c611b57bd1 Add index number to THArgCheck error message (#38978)
Summary:
- Resolving the feature introduced in https://github.com/pytorch/pytorch/issues/38652
- Since the iteration will be terminated once the error occurred, perhaps we can only give the current index which caused the error.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38978

Differential Revision: D21722426

Pulled By: ezyang

fbshipit-source-id: edfc3f7a320584ba22d790f2b79c3726e99aae2a
2020-05-26 16:07:04 -07:00
2751dda7f6 [docs] fix formula torch.logcumsumexp (#38952)
Summary:
Reference : https://github.com/pytorch/pytorch/pull/36308#issuecomment-632282641

After fix:

![Screenshot from 2020-05-23 15-35-09](https://user-images.githubusercontent.com/19503980/82727956-4bcabb80-9d0b-11ea-85a8-81b35012abbc.png)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38952

Differential Revision: D21722196

Pulled By: ezyang

fbshipit-source-id: 62b08c14e0ce9603133841940627df40d7b1e861
2020-05-26 16:02:43 -07:00
8650376444 DOC: fix import error (#38921)
Summary:
Fixes errors when importing the module. The import is used by sphinx in documentation builds.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38921

Differential Revision: D21722144

Pulled By: ezyang

fbshipit-source-id: 5f31d4750325f1753de93754a009006cbc13655e
2020-05-26 15:58:34 -07:00
47869b1b12 Windows build updates (#39035)
Summary:
Small follow-up updates https://github.com/pytorch/pytorch/pull/38971
- Remove extra whitespace
- Delete unused script files
- Modify `vs_install.ps1` to use correct workspace path
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39035

Differential Revision: D21731129

Pulled By: malfet

fbshipit-source-id: e04253e82f8753423b4634d1928f2d0fcf20ebbb
2020-05-26 15:52:50 -07:00
ccab142197 Add ROCm-specific half_support_literal for JIT. (#38899)
Summary:
CC ezyang xw285cornell sunway513 lcskrishna
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38899

Differential Revision: D21721855

Pulled By: ezyang

fbshipit-source-id: 3739c462f04cee40ff979f44387ef66b971f5303
2020-05-26 14:34:08 -07:00
0d649efb81 Updates torchvision version (#38848)
Summary:
In PyTorch 1.6 integer division using torch.div will throw a runtime error. When PyTorch Master adopts this behavior some of our ONNX tests would break if we continued to import torchvision v0.5, since v0.5 uses torch.div to perform integer division. fmassa and I recently updated Torchvision to use torch.floor_divide for integer division (at least on paths covered by the PyTorch OSS CI tests), and this PR updates our torchvision test version to include those changes. This will prevent the PyTorch OSS CI from breaking when PyTorch Master adopts the 1.6 integer division behavior.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38848

Differential Revision: D21679988

Pulled By: mruberry

fbshipit-source-id: 1333f6254c295909cf05b6f3e352e4a0c336e5af
2020-05-26 13:35:22 -07:00
12c219de54 Fix histc with empty tensor error (#38987)
Summary:
Fix https://github.com/pytorch/pytorch/issues/38979

The error in mentioned https://github.com/pytorch/pytorch/issues/38979 is a [`cudaErrorInvalidConfiguration` error](https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html#group__CUDART__TYPES_1g3f51e3575c2178246db0a94a430e0038):
> This indicates that a kernel launch is requesting resources that can never be satisfied by the current device. Requesting more shared memory per block than the device supports will trigger this error, as will requesting too many threads or blocks. See cudaDeviceProp for more device limitations.

This is because we are trying to launch a kernel with block size 0.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38987

Differential Revision: D21722993

Pulled By: ezyang

fbshipit-source-id: 2c283e0a9f542b4acb96e895a43b991ccac808fe
2020-05-26 13:19:13 -07:00
c40a79a027 [c2] cuda impl for WeightScale op (#38712)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38712

as title

Test Plan: buck test;

Reviewed By: ustctf

Differential Revision: D21586705

fbshipit-source-id: 12cd34f04f074ee12b77304055f3ba6068cf38fb
2020-05-26 12:50:54 -07:00
224ce03ebe Revert D21681838: add eq.str op to lite interpreter
Test Plan: revert-hammer

Differential Revision:
D21681838

Original commit changeset: 1f17ecdadb9b

fbshipit-source-id: bac620957d1a68057cd53f91f6a837d2b64f0e5e
2020-05-26 12:18:08 -07:00
e4a3c584d5 Fix max_pool2d nchw backward bug (#38953)
Summary:
Fix https://github.com/pytorch/pytorch/issues/38764

The current problem is that, `top_diff` and `top_mask` pointers are shifted "accumulatively" with for-n and for-c loops. This may cause overflow and illegal memory access when the loop counts are greater than one, that is n > 65535 or c > 65535 (the case in https://github.com/pytorch/pytorch/issues/38764). Since neither of n > 65535 or c > 65535 is common, it has not been seen before. The simple fix would be using new pointer variables for the n & c offset instead of directly modifying `top_diff` or `top_mask`.

However, I think the current nchw max_pool2d GPU impl still has plenty of room for performance improvement. We can check that in a later PR if needed.

Slightly clean up the indentation. Also add tests to use CPU impl as a reference check.

cc skrah
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38953

Differential Revision: D21721930

Pulled By: ezyang

fbshipit-source-id: fef7d911d814f8ed9fd67c60cabe5d52f8fd3d57
2020-05-26 12:00:31 -07:00
0ff1aa9058 Port TH cum{sum,prod}_cuda to ATen (#36458)
Summary:
References: https://github.com/pytorch/pytorch/issues/24521 #24522 https://github.com/pytorch/pytorch/issues/24547 #24548 https://github.com/pytorch/pytorch/issues/24507

Depends on https://github.com/pytorch/pytorch/issues/36308

Changes related to this PR are only in file :
aten/src/ATen/Declarations.cwrap
aten/src/ATen/native/cuda/ReduceOpsKernel.cu
aten/src/ATen/native/native_functions.yaml
aten/src/THC/generic/THCTensorMathScan.cu
aten/src/THC/generic/THCTensorMathScan.h

Please Review VitalyFedyunin

Thanks.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36458

Differential Revision: D21718384

Pulled By: ngimel

fbshipit-source-id: 5af15164050c77be164397abd659a48c9ded2b29
2020-05-26 11:50:16 -07:00
996b6a3d00 [vulkan] Fix python overrides tests for is_vulkan_available (#39016)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/39016

Differential Revision: D21724619

Pulled By: IvanKobzarev

fbshipit-source-id: d7a6c8b944a55bc4f2cce957eeac08c5801667a0
2020-05-26 11:42:55 -07:00
583ff947e1 Fix max_pool2d for returning wrong shape with return_indices=True on cuda (#38992)
Summary:
Fix https://github.com/pytorch/pytorch/issues/38986

The current code only resizes pooling output but forget to resize indices as well.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38992

Differential Revision: D21718324

Pulled By: ngimel

fbshipit-source-id: 7cf937966d38ab2167be79979475c4e0cacbf82c
2020-05-26 11:27:36 -07:00
c82375306c [vulkan] Fix Bazel build, add aten/native/vulkan/stub/*.cpp (#39018)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/39018

Differential Revision: D21724853

Pulled By: IvanKobzarev

fbshipit-source-id: 8d5bbc914b168da7d27c5447d625a9cfce61127f
2020-05-26 11:22:57 -07:00
fc4dfbf700 Remove reference of CUDA < 9.2 (#38977)
Summary:
Since CUDA < 9.2 is no longer supported (See https://github.com/pytorch/pytorch/pull/36848, https://github.com/pytorch/pytorch/pull/36846), this PR updates the required CUDA version in README.md to avoid confusion.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38977

Differential Revision: D21722965

Pulled By: ezyang

fbshipit-source-id: 626772f4303d023918dda34a620d95693174d97f
2020-05-26 09:23:26 -07:00
108321dc41 move int8 fc operators and dependencies (#38935)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38935

move int8 ops

Test Plan: sandcastle

Reviewed By: jackm321, zrphercule

Differential Revision: D21704235

fbshipit-source-id: 7d5b570a5840ff21ffa6256604f892a084a30b31
2020-05-26 09:17:54 -07:00
4d1df74c7c Use a temporary file during ReducerTest (#39004)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/37558

Use a temporary file instead of `/dev/null` in `ReducerTest`, to prevent the chance of unintended deletion when running as root. It seemed that there were no strong side-effects (running time?) by fixing it at the test level, compared to other solutions that involved modifying the behaviour of `FileStore` (for example, adding an optional flag to avoid auto-deleting the file upon destruction).

Please note this is my first contribution - I have done my best to read the contributing guide and checked for duplicate PRs with no luck, but apologies in advance for any oversights and lack of familiarity with the procedures.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39004

Differential Revision: D21721966

Pulled By: mrshenli

fbshipit-source-id: 76fb81600fa08a91c35d0eb9a5aab179f5371422
2020-05-26 09:05:34 -07:00
1fef2075a5 Disable some unsupported module for 32-bit build (#38950)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/38322#issuecomment-632976523 and https://github.com/pytorch/pytorch/issues/38322#issuecomment-628698852.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38950

Differential Revision: D21721918

Pulled By: ezyang

fbshipit-source-id: 999788bb88d3e3c2c06f8dec4f0d6b3389095936
2020-05-26 08:30:35 -07:00
81daadf651 Expose VC_YEAR in Windows binary test jobs (#38957)
Summary:
To make it configurable  in https://github.com/pytorch/builder/pull/445.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38957

Differential Revision: D21721933

Pulled By: ezyang

fbshipit-source-id: 510b19e59bed4ff9d6c39173b4d5c5fc69290ed0
2020-05-26 08:30:29 -07:00
6ddca30b2d Updates assertEqual to require atol and rtol, removes positional atol (#38872)
Summary:
This updates assertEqual and assertEqual-like functions to either require both or neither of atol and rtol be specified. This should improve clarity around handling precision in the test suite, and it allows us to remove the legacy positional atol argument from assertEqual. In addition, the "message" kwarg is replace with a kwarg-only "msg" argument whose name is consistent with unittest's assertEqual argument.

In the future we could make "msg" an optional third positional argument to be more consistent with unittest's assertEqual, but requiring it be specified should be clear, and we can easily update the signature to make "msg" an optional positional argument in the future, too.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38872

Differential Revision: D21717199

Pulled By: mruberry

fbshipit-source-id: 9feb856f94eee911b44f6c7140a1d07c1b026d3a
2020-05-26 08:30:23 -07:00
341fd63ff6 add eq.str op to lite interpreter (#38859)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38859

This error message indicates aten::eq expects different types

```
RUNNING 379 OP 76, aten::eq
terminate called after throwing an instance of 'c10::Error'
  what():  isInt() INTERNAL ASSERT FAILED at "buck-out/gen/68e83026/xplat/caffe2/aten_header#header-mode-symlink-tree-with-header-map,headers/ATen/core/ivalue.h":331, please report a bug to PyTorch.
```

It turns out that there are two aten::eq in lite interpreter (https://www.internalfb.com/intern/diffusion/FBS/browse/master/xplat/caffe2/torch/csrc/jit/runtime/register_prim_ops.cpp?lines=417)
aten::eq(int, int)
aten::eq(str, str)

This diff add overload name for str and it fixed the problem.

Test Plan: local test

Reviewed By: pengtxiafb

Differential Revision: D21681838

fbshipit-source-id: 1f17ecdadb9bc1c16915a24c60fa57a6fc273865
2020-05-26 08:30:18 -07:00
b460465a18 [Mobile GPU][Integration] Vulkan backend integration (#36491)
Summary:
This PR contains the initial version of Vulkan (GPU) Backend integration.
The primary target environment is Android, but the desktop build is also supported.

## CMake
Introducing three cmake options:
USE_VULKAN:
The main switch, if it is off, all other options do not affect.
USE_VULKAN_WRAPPER:
ON - Vulkan will be used loading it at runtime as "libvulkan.so" using libdl, every function call is wrapped in vulkan_wrapper.h.
OFF - linking with libvulkan.so directly
USE_VULKAN_SHADERC_RUNTIME:
ON - Shader compilation library will be linked, and shaders will be compiled runtime.
OFF - Shaders will be precompiled and shader compilation library is not included.

## Codegen
if `USE_VULKAN_SHADERC_RUNTIME` is ON:
Shaders precompilation () starts in cmake/VulkanCodegen.cmake, which calls `aten/src/ATen/native/vulkan/gen_glsl.py` or `aten/src/ATen/native/vulkan/gen_spv.py` to include shaders source or SPIR-V bytecode inside binary as uint32_t array in spv.h,spv.cpp.
if `USE_VULKAN_SHADERC_RUNTIME` is OFF:
The source of shaders is included as `glsl.h`,`glsl.cpp`.

All codegen results happen in the build directory.

## Build dependencies
cmake/Dependencies.cmake
If the target platform is Android - vulkan library, headers, Vulkan wrapper will be used from ANDROID_NDK.
Desktop build requires the VULKAN_SDK environment variable, and all vulkan dependencies will be used from it.
(Desktop build was tested only on Linux).

## Pytorch integration:
Adding 'Vulkan" as new Backend, DispatchKey, DeviceType.
We are using Strided layout without supporting strides at the moment, but we plan to support them in the future.
Using OpaqueTensorImpl where OpaqueHandle is copyable VulkanTensor,
more details in comments in `aten/src/ATen/native/vulkan/Vulkan.h`

Main code location: `aten/src/ATen/native/vulkan`
`aten/src/ATen/native/vulkan/VulkanAten.cpp` - connection link between ATen and Vulkan api (Vulkan.h) that converts at::Tensor to VulkanTensor.

`aten/src/ATen/native/Vulkan/Vulkan.h` - Vulkan API that contains VulkanTensor representation and functions to work with it. Plan to expose it for clients to be able to write their own Vulkan Ops.

`aten/src/ATen/native/vulkan/VulkanOps.cpp` - Vulkan Operations Implementations that uses Vulkan.h API

## GLSL shaders
Located in `aten/src/ATen/native/vulkan/glsl` as *.glsl files.
All shaders use Vulkan specialized constants for workgroup sizes with ids 1, 2, 3

## Supported operations
Code point:
conv2d no-groups
conv2d depthwise
addmm
upsample nearest 2d
clamp
hardtanh

## Testing
`aten/src/ATen/test/vulkan_test.cpp` - contains tests for
copy from CPU to Vulkan and back
all supported operations
Desktop builds supported, and testing can be done on a desktop that has Vulkan supported GPU or with installed software implementation of Vulkan, like https://github.com/google/swiftshader

## Vulkan execution
The initial implementation is trivial and waits every operator's execution.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36491

Differential Revision: D21696709

Pulled By: IvanKobzarev

fbshipit-source-id: da3e5a770b1a1995e9465d7e81963e7de56217fa
2020-05-26 08:30:13 -07:00
1fa0bb6d9d Use workspace to persist and restore images for Windows CI build and … (#38971)
Summary:
Inspired by malfet

> By the way, once we have build_artifacts property, can someone try if its faster to use it as mean of transferring images between build and test instead of using AWS (i.e. use artifacts instead of jenkins/pytorch/win-test-helpers/upload_image.py /download_image.py pair)

Use CircleCI to store intermediate binaries and make them available to be downloaded as artifacts instead of uploading to S3.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38971

Differential Revision: D21717080

Pulled By: seemethere

fbshipit-source-id: e3498b058778d02ae2f38daefbc7118a1a2cbe76
2020-05-26 08:30:07 -07:00
f07a60fcd4 Updating submodules
Summary:
GitHub commits:

3491869253
56b6191c13

Test Plan: n/a

Reviewed By: 2d2d2d2d2d

fbshipit-source-id: a117b779500b040f5c1e087ed4ffb1587d745663
2020-05-26 08:30:02 -07:00
c34b333230 improve accuracy of logsoftmax computation on cuda (#38945)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/38839. Previously, if magnitude of input values was large, when computing `max+log(sum)` the `log(sum)` value was essentially ignored, now the result is computed as
`x-max-log(sum)` which has a better chance of preserving accuracy.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38945

Differential Revision: D21712483

Pulled By: ngimel

fbshipit-source-id: c1a3599ed981ba7a7fd130cbd7040a706b7eace0
2020-05-26 08:29:56 -07:00
389e16c33b torch.pow Add type promotion support and fix issue with __rpow__ (#37098)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37098

### **Cherry-picked from another stack:**
Some code review already occurred here: https://github.com/pytorch/pytorch/pull/32582

### Summary:

Fixes: https://github.com/pytorch/pytorch/issues/32436

The issue caused incorrect handling of dtypes for scalar ** tensor.
e.g. before this change:
```
>>> 5.5 ** torch.ones(5, dtype=torch.int32)
tensor([5, 5, 5, 5, 5], dtype=torch.int32)
```
should return a float tensor.

Also fixes a number of incorrect cases:
 * tensors to negative powers were giving incorrect results (1 instead
    of 0 or error)
 * Behavior wasn't consistent between cuda/cpu
 * large_value ** 1 in some cases gave a result not equal
    to large_value because of truncation in conversion to double and back.

BC-breaking:

Previously incorrect behavior (in 1.4):
```
>>> a
tensor([1, 1, 1, 1, 1], dtype=torch.int32)
>>> a.pow_(.5)
tensor([1, 1, 1, 1, 1], dtype=torch.int32)
```

After this change:
`RuntimeError: result type Float can't be cast to the desired output type Int`

Test Plan: Imported from OSS

Differential Revision: D21686207

Pulled By: nairbv

fbshipit-source-id: e797e7b195d224fa46404f668bb714e312ea78ac
2020-05-26 08:29:51 -07:00
ba3893e736 Rename torch._C.Generator to torch.Generator (#38773)
Summary:
Fix https://github.com/pytorch/pytorch/issues/26528
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38773

Differential Revision: D21701053

Pulled By: pbelevich

fbshipit-source-id: 57632ca9ce430ec30dc8e40739194ee2b5860f71
2020-05-26 08:29:46 -07:00
b8f2ecbfb6 Update TensorPipe submodule (#38923)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/38923

Test Plan: We'll see on CircleCI

Reviewed By: beauby

Differential Revision: D21703670

fbshipit-source-id: fd477486226303130906d669b0e9a1c888cfeee0
2020-05-26 08:28:04 -07:00
5749ef75d3 Update ShipIt sync
fbshipit-source-id: 945b4dfe99016e1788d0ec5429343e8c610f4e20
2020-05-26 08:11:42 -07:00
7e6f6f522f [PATCH] Migrate min from THC to ATen and remove _min (#38440)
Summary:
Related issue: https://github.com/pytorch/pytorch/issues/36900

Since I feel this PR is already large enough, I didn't migrate max in this PR. Legacy code is not cleaned up either. All these remaining work will be done in later PRs after this is merged.

Benchmark on an extreme case
```python
import torch
print(torch.__version__)

t = torch.randn(100000, 2, device='cuda')

warmup = torch.arange(100000000)
torch.cuda.synchronize()

%timeit t.min(dim=0); torch.cuda.synchronize()
```
Before: 4ms; After: 24.5us.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38440

Differential Revision: D21560691

Pulled By: ngimel
2020-05-26 08:10:38 -07:00
d035d05080 [pytorch] expose __ldg(const Half* ptr) to Clang in host mode (#38151)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38151

We need to expose this method to Clang unconditionally when building CUDA, otherwise it would error on device code calling `__ldg` with `Half*`.

Test Plan:
```
buck build -c fbcode.caffe2_use_mpi=1 -c fbcode.cuda_use_clang=true mode/opt //experimental/training_supercomputer/trainer/hpc_pt:trainer
```

Reviewed By: ngimel

Differential Revision: D21481297

fbshipit-source-id: aacfe7de2cdc8542908249081ddb58170b1e35ff
2020-05-21 22:18:32 -07:00
f3f3097a4c Use old circleci windows image for both CPU and CUDA (#38909)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/38909

Differential Revision: D21701160

Pulled By: malfet

fbshipit-source-id: 7eb81b76e3e9b269ded668e873e10695e7bb1ae4
2020-05-21 21:53:15 -07:00
cd5d7a34b8 [JIT] Factor out aliases to separate test (#38746)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38746

Factors out testing of op alias normalization so that there is a registry used for tests.

Test Plan: Imported from OSS

Differential Revision: D21673107

Pulled By: eellison

fbshipit-source-id: e06653cdf24f14a4253dd054e4d402d171d16a11
2020-05-21 21:47:24 -07:00
f90dc741eb [JIT] Normalize op aliases (#38735)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38735

Follow up to my comment https://github.com/pytorch/pytorch/pull/36597/#issuecomment-613674329

This adds a pass to convert op aliases into a normalized form. Having two ops generated in our IR that do the same thing makes the IR harder for downstream consumers of the IR, such as TorchScript passes but also ONNX, glow, etc.

Another solution would have been to fix our code generation to only emit `aten::abs` from the start. This seems trickier, and doesn't really buy us much if we still have to expose `aten::absolute` in C++, as glaringlee of the C++ API thinks we should.

Bike shedding: maybe this should be `CanonicalizeOps` instead

Test Plan: Imported from OSS

Differential Revision: D21673108

Pulled By: eellison

fbshipit-source-id: c328618907de1af22e07f57fd27fa619978c2817
2020-05-21 21:47:17 -07:00
5183e3aa16 [JIT] Rename canonicalize ops (#38734)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38734

As far as I can tell, this pass only exists to canonicalize ops that are generating in the graph fuser, so it's kind of a misnomer.

Test Plan: Imported from OSS

Differential Revision: D21673109

Pulled By: eellison

fbshipit-source-id: b7bedf34ccaf1fcd442bfb2bbb990e64915f51d4
2020-05-21 21:45:15 -07:00
22454c5aeb Collect and upload error logs if VisualStudio installation fails (#38902)
Summary:
Add `store_artifacts` attribtue to Windows build jobs
In `vs_install.ps1` add logic to download vscollect tool and upload collected results as build artifacts
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38902

Differential Revision: D21700598

Pulled By: malfet

fbshipit-source-id: b51c47ff44ac522ad5581624f5b9a9a86cf1e595
2020-05-21 20:05:25 -07:00
4c0bf93a0e Revert D21057090: Remove useless copy on zip file load
Test Plan: revert-hammer

Differential Revision:
D21057090

Original commit changeset: e3d30a3b09f4

fbshipit-source-id: b24cbe77aae38b321882e7dcf41022710ee28ed0
2020-05-21 19:34:18 -07:00
a53422e0ee [FakeLowp] Open source more c2 ops (#38878)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38878

We need to Packing op and shape extraction functions to make  some of the FakeLowP tests run in OSS.

Test Plan: unittests

Reviewed By: hyuen

Differential Revision: D21682704

fbshipit-source-id: f36321b91acfd738e90543309b82ad87b9e5c156
2020-05-21 19:10:04 -07:00
8d8b586c7a fake_quant: make qparams shape consistent (#38587)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38587

Before this diff, scale+zp were initialized to tensors
with a single dimension and 1 element, and then switched
to scalar tensors after the first forward.

This diff makes the shape stay consistent.  This should fix
an issue reported when saving/loading models, which crashes
on this inconsistent shape.

Test Plan:
```
python test/test_quantization.py TestFakeQuantizePerTensor.test_fake_quant_preserves_qparam_shapes_for_activations
```

Imported from OSS

Differential Revision: D21605532

fbshipit-source-id: e00cd268d6d3ded1006d18d6c6759c911b3a74ea
2020-05-21 19:08:08 -07:00
455bf77da5 Remove useless copy on zip file load (#36362)
Summary:
Instead of copying to a buffer, then setting a tensor's storage with that buffer, create a storage directly from the file
](https://our.intern.facebook.com/intern/diff/21057090/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36362

Pulled By: driazati

Differential Revision: D21057090

fbshipit-source-id: e3d30a3b09f4d67bf4bb7a0dd7f4f60c3dd1a47e
2020-05-21 18:57:06 -07:00
8e69c3be17 [nvFuser] Reduction support in codegen, fp16 support (#38627)
Summary:
Adds reduction  support for the code generator. Reductions are fully supported with split/merge/reorder/rfactor/computeAt/unroll operators. There is also cross thread (intra-block) reduction support.

The two remaining pieces missing for reduction support is:
- Safety: If cross thread reduction was used, child operators shouldn't be able to bind that thread dim anymore
- Cross block reduction: we will want inter-block reduction support to match parity with tensor iterator

PR also provides FP16 support for fusions now. We insert casts on FP16 inputs to FP32, and we insert casts to FP16 on FP16 outputs.

Also working towards reductions and shape inference for reductions in the fusion pass.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38627

Reviewed By: albanD

Differential Revision: D21663196

Pulled By: soumith

fbshipit-source-id: 3ff2df563f86c39cd5821ab9c1148149e5172a9e
2020-05-21 17:18:39 -07:00
d3b0cf9ae9 Kill AT_DISPATCH_ALL_TYPES_AND_C10_COMPLEX_AND (#38462)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/38462

Test Plan: Imported from OSS

Differential Revision: D21663878

Pulled By: anjali411

fbshipit-source-id: f58a173a1d7cd56986788a28a28c76dbf4386c01
2020-05-21 16:31:30 -07:00
9b9fc59b0a Add cuda version of clang9 image. (#38825)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/38825

Differential Revision: D21687353

Pulled By: ailzhang

fbshipit-source-id: d99bb6d034b26e851ddaf7ff2ef572ae58cc20bc
2020-05-21 16:21:43 -07:00
f9eb8824f1 Remove datatype from Storage and StorageImpl (#38870)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38870

* Removed dtype data member from StorageImpl
* Removed any methods or method arguments in Storage/StorageImpl that deal with dtypes
* Update all callers of the changed API

Part of issue https://github.com/pytorch/pytorch/issues/33950
Original PR: https://github.com/pytorch/pytorch/pull/38038

Reviewed By: albanD

Differential Revision: D21549645

Pulled By: ezyang

fbshipit-source-id: 4289b356c55ff6b9530376a79343b99b540ee3de
2020-05-21 15:26:08 -07:00
9b656dac7f Switch AT_DISPATCH_COMPLEX_TYPES_AND and AT_DISPATCH_ALL_TYPES_AND_HALF_AND_COMPLEX to c10::complex (#37697)
Summary:
These two macros only appear in `Dispatch.h`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37697

Differential Revision: D21666340

Pulled By: anjali411

fbshipit-source-id: 1f31ab46c08b77f1011367e471874d390ffa70fb
2020-05-21 15:05:54 -07:00
0e2a0478af Support paths with spaces when building ninja extension (#38670)
Summary:
Generate the following `build.ninja` file and can successfully build:
```
cflags = -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -fPIC -DWITH_CUDA '-I/scratch/yuxinwu/space space/detectron2/layers/csrc' -I/private/home/yuxinwu/miniconda3/lib/python3.7
/site-packages/torch/include -I/private/home/yuxinwu/miniconda3/lib/python3.7/site-packages/torch/include/torch/csrc/api/include -I/private/home/yuxinwu/miniconda3/lib/python3.7/site-packages/torc
h/include/TH -I/private/home/yuxinwu/miniconda3/lib/python3.7/site-packages/torch/include/THC -I/public/apps/cuda/10.1/include -I/private/home/yuxinwu/miniconda3/include/python3.7m -c
post_cflags = -DTORCH_API_INCLUDE_EXTENSION_H -DTORCH_EXTENSION_NAME=_C -D_GLIBCXX_USE_CXX11_ABI=0 -std=c++14
cuda_cflags = -DWITH_CUDA '-I/scratch/yuxinwu/space space/detectron2/layers/csrc' -I/private/home/yuxinwu/miniconda3/lib/python3.7/site-packages/torch/include -I/private/home/yuxinwu/miniconda3/li
b/python3.7/site-packages/torch/include/torch/csrc/api/include -I/private/home/yuxinwu/miniconda3/lib/python3.7/site-packages/torch/include/TH -I/private/home/yuxinwu/miniconda3/lib/python3.7/site
-packages/torch/include/THC -I/public/apps/cuda/10.1/include -I/private/home/yuxinwu/miniconda3/include/python3.7m -c
cuda_post_cflags = -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options '-fPIC' -DCUDA_HAS_FP16=1 -D__CUDA_NO_HALF_
OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ -ccbin=/public/apps/gcc/7.1.0/bin/gcc -DTORCH_API_INCLUDE_EXTENSION_H -DTORCH_EXTENSION_NAME=_C -D_GLIBCXX_USE_CXX11_ABI=0
-gencode=arch=compute_60,code=sm_60 -gencode=arch=compute_70,code=sm_70 -std=c++14
ldflags =

rule compile
  command = $cxx -MMD -MF $out.d $cflags -c $in -o $out $post_cflags
  depfile = $out.d
  deps = gcc

rule cuda_compile
  command = $nvcc $cuda_cflags -c $in -o $out $cuda_post_cflags

build /scratch/yuxinwu/space$ space/build/temp.linux-x86_64-3.7/scratch/yuxinwu/space$ space/detectron2/layers/csrc/vision.o: compile /scratch/yuxinwu/space$ space/detectron2/layers/csrc/vision.c$
p
build /scratch/yuxinwu/space$ space/build/temp.linux-x86_64-3.7/scratch/yuxinwu/space$ space/detectron2/layers/csrc/box_iou_rotated/box_iou_rotated_cpu.o: compile /scratch/yuxinwu/space$ space/de$
ectron2/layers/csrc/box_iou_rotated/box_iou_rotated_cpu.cpp
build /scratch/yuxinwu/space$ space/build/temp.linux-x86_64-3.7/scratch/yuxinwu/space$ space/detectron2/layers/csrc/ROIAlignRotated/ROIAlignRotated_cpu.o: compile /scratch/yuxinwu/space$ space/de$
ectron2/layers/csrc/ROIAlignRotated/ROIAlignRotated_cpu.cpp
build /scratch/yuxinwu/space$ space/build/temp.linux-x86_64-3.7/scratch/yuxinwu/space$ space/detectron2/layers/csrc/nms_rotated/nms_rotated_cpu.o: compile /scratch/yuxinwu/space$ space/detectron2$
layers/csrc/nms_rotated/nms_rotated_cpu.cpp
build /scratch/yuxinwu/space$ space/build/temp.linux-x86_64-3.7/scratch/yuxinwu/space$ space/detectron2/layers/csrc/ROIAlign/ROIAlign_cpu.o: compile /scratch/yuxinwu/space$ space/detectron2/layer$
/csrc/ROIAlign/ROIAlign_cpu.cpp

```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38670

Differential Revision: D21689613

Pulled By: ppwwyyxx

fbshipit-source-id: 1f71b12433e18f6b0c6aad5e1b390b4438654563
2020-05-21 14:57:40 -07:00
b1982c4bdb Fix multiline signatures in docstring (#38768)
Summary:
Fix https://github.com/pytorch/pytorch/issues/38694

See https://5533621-65600975-gh.circle-artifacts.com/0/docs/torch.html

## Index Page
| Before | After |
| --- | --- |
| ![image](https://user-images.githubusercontent.com/6421097/82448124-ee1a4300-9a6e-11ea-9a48-cabf62eedd92.png)  | ![image](https://user-images.githubusercontent.com/6421097/82448175-fd00f580-9a6e-11ea-8c79-c3dd6bac0b69.png) |
| ![image](https://user-images.githubusercontent.com/6421097/82448234-0f7b2f00-9a6f-11ea-8221-19335ee60aa2.png) | ![image](https://user-images.githubusercontent.com/6421097/82448262-19049700-9a6f-11ea-9eea-ac2f71068d7f.png) |

## Detail Page
| Before | After |
| --- | --- |
| ![image](https://user-images.githubusercontent.com/6421097/82448421-4fdaad00-9a6f-11ea-9909-29692cb8ca01.png) | ![image](https://user-images.githubusercontent.com/6421097/82448440-5701bb00-9a6f-11ea-8c07-d06cb0cdfa50.png) |
| ![image](https://user-images.githubusercontent.com/6421097/82448496-68e35e00-9a6f-11ea-8db9-2d75a9328b3a.png) | ![image](https://user-images.githubusercontent.com/6421097/82448539-7567b680-9a6f-11ea-9c2e-a59eca4090c4.png) | ![image](https://user-images.githubusercontent.com/6421097/82448563-7d275b00-9a6f-11ea-97af-51f45969f473.png) |
| ![image](https://user-images.githubusercontent.com/6421097/82448329-320d4800-9a6f-11ea-8d24-3d33445cf591.png) | ![image](https://user-images.githubusercontent.com/6421097/82448353-389bbf80-9a6f-11ea-8cc8-752d3fd0dee1.png) |
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38768

Differential Revision: D21691859

Pulled By: zou3519

fbshipit-source-id: 336158be450436554a1fa2105a5eedf24236c56b
2020-05-21 14:39:32 -07:00
acc181c2ea Document torch.utils.cmake_prefix_path (#38727)
Summary:
Documents new global variable pointing to PyTorch CMake config files
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38727

Differential Revision: D21694243

Pulled By: malfet

fbshipit-source-id: 652532cd5da9945caf7d7dfe1fde696dc474661b
2020-05-21 14:34:19 -07:00
6d4d508d8e Log incorrect device in ProcessGroupGloo (#38844)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38844

Enhances error message in ProcessGroupGloo to log the unsupported
device. Been seeing a few issues with this and this will provide more debug
information.

Test Plan: CI

Differential Revision: D21676881

fbshipit-source-id: 1fd727162682e1a55003adff67c4358dab488455
2020-05-21 13:16:50 -07:00
5dd65ba634 .circleci: Add simple backup and restore solution for RCs (#38690)
Summary:
* Does a basic upload of release candidates to an extra folder within our
S3 bucket.
* Refactors AWS promotion to allow for easier development of restoration
of backups

Backup restoration usage:
```
RESTORE_FROM=v1.6.0-rc3 restore-backup.sh
```
Requires:
  * AWS credentials to upload / download stuff
  * Anaconda credentials to upload
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38690

Differential Revision: D21691033

Pulled By: seemethere

fbshipit-source-id: 31118814db1ca701c55a3cb0bc32caa1e77a833d
2020-05-21 13:09:12 -07:00
481838f21b Sphinx parallel build (#38785)
Summary:
See https://www.sphinx-doc.org/en/master/man/sphinx-build.html#cmdoption-sphinx-build-j

> Distribute the build over N processes in parallel, to make building on multiprocessor machines more effective. Note that not all parts and not all builders of Sphinx can be parallelized. If auto argument is given, Sphinx uses the number of CPUs as N.

- Timing results
  - Python doc build on a 40-core machine: 9:34 down to 1:29
  - pytorch_cpp_doc_push: ~1h 10m down to 47m
  - pytorch_python_doc_push: 34m down to 32m
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38785

Differential Revision: D21691991

Pulled By: zou3519

fbshipit-source-id: cfc5e8cd13414640f82edfd2ad1ce4d9c7afce12
2020-05-21 13:03:55 -07:00
a40049fd2a Better handling for msvc env when compiling cpp extensions (#38862)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/38861#issuecomment-631934636.
1. Error out if msvc env is activated but `DISTUTILS_USE_SDK` is not set.
2. Attempt to activate msvc env before running ninja build
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38862

Differential Revision: D21686343

Pulled By: ezyang

fbshipit-source-id: 38b366654e2d0376dbdd21276689772b78e9718e
2020-05-21 12:52:22 -07:00
4e46c95826 Fix cpp extension build failure if path contains space (#38860)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/38860

Differential Revision: D21686335

Pulled By: ezyang

fbshipit-source-id: 2675f4f70b48ae3b58ea597a2b584b446d03c704
2020-05-21 12:36:27 -07:00
b9105f42a1 Kill AT_DISPATCH_ALL_TYPES_AND_C10_COMPLEX (#38792)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/38792

Test Plan: Imported from OSS

Differential Revision: D21669629

Pulled By: anjali411

fbshipit-source-id: f9dd46219ff90217d94315f7223b49cc4aeab091
2020-05-21 11:45:05 -07:00
bf9395438f Disable test_nccl for ROCm (#38801)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38801

NCCL specific tests that shouldn't be run on ROCm
ghstack-source-id: 104481245

Test Plan: waitforbuildbot

Differential Revision: D21667348

fbshipit-source-id: a3e558185d9b74e1eac5fae27d97d5d026baa0a1
2020-05-21 11:15:08 -07:00
07bed4b7ef remove redundant contiguous in unfold_backward. (#38871)
Summary:
As per title. Makes for a 5-25% boost on CPU in tests from [https://github.com/pytorch/pytorch/issues/36612](https://github.com/pytorch/pytorch/pull/36612).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38871

Differential Revision: D21687007

Pulled By: albanD

fbshipit-source-id: c64b2545ad159fa7463cc32f2e2a72dde6229eff
2020-05-21 10:30:24 -07:00
3487744821 Add torch.logcumsumexp (#36308)
Summary:
Creating new PR as I am unable to push to pandeykartikey 's branch as I don't have the permissions.

Closes https://github.com/pytorch/pytorch/issues/26411

Based on https://github.com/pytorch/pytorch/issues/32876 Thanks pandeykartikey for starting this out.

Have addressed the comments.

anjali411 agadetsky albanD
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36308

Differential Revision: D21648573

Pulled By: albanD

fbshipit-source-id: bc1a8fc4ab474a1148298117a1549b0e46f7c3ff
2020-05-21 09:12:31 -07:00
b88b7d552f Prevent custom Functions from creating non differentiable type that requires grad (#38326)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/38326

Test Plan: Imported from OSS

Differential Revision: D21668740

Pulled By: albanD

fbshipit-source-id: f452f65e76003492055311523a652937b1300183
2020-05-21 08:30:14 -07:00
0f1669181a Add specific list of supported types in autograd (#38325)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/38325

Test Plan: Imported from OSS

Differential Revision: D21668739

Pulled By: albanD

fbshipit-source-id: 2e6ebaa36e41a084aed0a8e1e16b6e37e36a1910
2020-05-21 08:28:06 -07:00
a83f25314b Some TODO fixes. (#37829)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37829

- Removes memset not needed.
- Removed separate packing function added for dynamic linear.

Test Plan:
Quantized tests.

Imported from OSS

Differential Revision: D21404841

fbshipit-source-id: b24fc36961a65a9be3d4c12768031ea70bae4394
2020-05-21 07:58:57 -07:00
2c2fe6356a Add a check for stride==0 in gradcheck (#38774)
Summary:
Fix https://github.com/pytorch/pytorch/issues/38586

Raise a proper error and fix the failing test.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38774

Differential Revision: D21668720

Pulled By: albanD

fbshipit-source-id: 5d15e9885934661c30c3dc6dd7389b7a33456a33
2020-05-21 07:54:29 -07:00
6f0e53624d Enforce that named_tensor_meta_ is non-null only if there is a non-wildcard name (#38725)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38725

Today, there are two equivalent representations: named_tensor_meta_ is
null, or named_tensor_meta_ is non-null but all of the dimension names
are wildcard.  Let's reduce the opportunity for behavior divergence by
making the second representation illegal.

This will make it easier for me to add a dispatch key for named
tensor as I can rely on setters to always go through TensorImpl to
maintain invariants on DispatchKey.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Differential Revision: D21662641

Pulled By: ezyang

fbshipit-source-id: ccc6566d23ad2ba850f653364a86cc8db0428223
2020-05-21 07:48:55 -07:00
1ea80b4234 [ROCm] Set correct tolerance values for bfloat16 div tests (#38823)
Summary:
This PR fixes the tolerance values for some of the bfloat16 div tests that were enabled on ROCm with incorrect tolerance values in the PR https://github.com/pytorch/pytorch/pull/38621

Also disabled(to unblock CI) `test_addcdiv*` for which the error is large when absolute values in the tensor are higher. This will have to be investigated further.

ezyang jeffdaily sunway513
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38823

Differential Revision: D21686290

Pulled By: ezyang

fbshipit-source-id: 85472680e1886bdc7c227ed2656e0b4fd5328e46
2020-05-21 07:29:49 -07:00
d363cf4639 Fix incorrect __torch_function__ handling in einsum (#38741)
Summary:
Closes gh-38479
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38741

Differential Revision: D21662512

Pulled By: ezyang

fbshipit-source-id: 247e3b50b8f2dd842c03be8d6ebe71910b619bc6
2020-05-21 06:59:25 -07:00
664a3ab5c7 Enable py38 gcc9 build config (#38805)
Summary:
Add `py38-gcc9` build-only config
Add appropriate `-Wno-xyz` flags to ATEN kernels as well as `tensorexp/llvm_jit.cpp` and `tensorexp/llvm_codegen.cpp`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38805

Differential Revision: D21682953

Pulled By: malfet

fbshipit-source-id: 5b61d0dfe8bdec8fb13e2ae5857dc5e7c6e58e42
2020-05-21 01:38:04 -07:00
e9902358df Support fp16 output in OnnxifiOp (#38846)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38846

We begin to have fp16 inputs/outputs. Adding this will help with the debugging.

Test Plan: run.

Reviewed By: jfix71

Differential Revision: D21676805

fbshipit-source-id: 47788e631164d24aef0f659b281c59822b009e18
2020-05-20 22:50:24 -07:00
65e8fe1832 Perf optimization for conv and gemm kernels. (#37626)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37626

Did some rescheduling of the instructions to hide latency of the loads.
Particularly at the start of the kernel we have latency bound chains.
It seems to improve perf form aarch32.
Also did some inst rescheduling for aarch64 gemm kernel. Not clear if
this actually helps with perf espcially in OOO CPUs, but worth a try.

Test Plan:
qnnpack tests
q8gemm-test

Imported from OSS

Differential Revision: D21339037

fbshipit-source-id: 0469581a0e3bd3fd04f15200c2171fc8c264722b
2020-05-20 21:11:34 -07:00
0b2a861507 convbn fusion: add backwards compatibility support (#38820)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38820

Missed this on https://github.com/pytorch/pytorch/pull/38478/ -
the conv BN refactor was not backwards compatible because it
changed the state dict keys.

This PR adds logic to load the state dict from the old format.

Test Plan:
create ConvBn2d module instance and save state dict:
https://gist.github.com/vkuzo/5ed4701c122f629a51988d0748a3223e
load ConvBn2d from state dict:
https://gist.github.com/vkuzo/f97cb52057b6c7792920b8ae407f646b
verify that all valid permutations of above between v1 and v2 work correctly

Imported from OSS

Differential Revision: D21671329

fbshipit-source-id: 91b9ce88f99500bf4f1868ba638f1c90a594f0da
2020-05-20 20:38:46 -07:00
4d5d9c0455 qat syncbn: add test coverage (#38738)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38738

Adds test coverage for swapping BN -> SyncBN on a fused ConvBN module.

Test Plan:
```
python test/test_quantization.py TestDistributed.test_qat_convbn_fused_syncbn_replacement
```

Imported from OSS

Differential Revision: D21648320

fbshipit-source-id: 3f7f71ec7b34d7d784dcbce9974c525b5db35942
2020-05-20 20:37:13 -07:00
a8d8fc5532 [quant][graphmode] Different rule for add/add_/mul/mul_ (#38667)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/38667

Test Plan: Imported from OSS

Differential Revision: D21633555

fbshipit-source-id: 03b0298e83bf4dbda41b048c0edc7bb92cd4e1df
2020-05-20 19:43:46 -07:00
57d6e19d6f Use union to cast between incompatible function pointers (#38842)
Summary:
This fixes `can not cast between incompatible function types` error if code is compiled by gcc-9.3
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38842

Differential Revision: D21676360

Pulled By: malfet

fbshipit-source-id: d8b05d8381bfc961e06981731ebca87a516c2811
2020-05-20 19:38:18 -07:00
6cf5c71b32 Updating submodules
Summary:
GitHub commits:

909926d1ee
8abb78f423
ac53e737cf
0b892bcbfb
eb04bb86c6
0c2c715235
5c5e7ad98c
450e1aaae6
60e318d48d
b4b0ff439e

Test Plan: n/a

Reviewed By: 2d2d2d2d2d

fbshipit-source-id: 11bfde5db57254d449c0d5fb4cea1a895432989c
2020-05-20 19:30:05 -07:00
48116ac8d0 Revert "Revert D21593870: Kill AT_DISPATCH_ALL_TYPES_AND_C10_COMPLEX_AND2" (#38814)
Summary:
The failure was caused by cross merge conflicts. A new use of `AT_DISPATCH_ALL_TYPES_AND_C10_COMPLEX_AND2` at `ATen/native/cuda/TensorTransformations.cu` was added before the reverted PR merged. See c73523a4c3
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38814

Differential Revision: D21670650

Pulled By: malfet

fbshipit-source-id: 867636cdb0106cb1275617ad2e355736d5d77210
2020-05-20 19:23:26 -07:00
f80df4ca79 port scatter_add to ATen (CUDA) (#38262)
Summary:
Fixes [https://github.com/pytorch/pytorch/issues/24622 ](https://github.com/pytorch/pytorch/issues/24622).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38262

Differential Revision: D21656729

Pulled By: ngimel

fbshipit-source-id: 63dcbf8eeaf59d8295bf4e5c8bb9d28ad165d4eb
2020-05-20 19:03:41 -07:00
83fa3f1c36 Add HIP to the memory profiler device list (#38795)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38795

Add HIP alongside CUDA

Test Plan: rocm CI

Differential Revision: D21665627

Pulled By: ilia-cher

fbshipit-source-id: 76ddf0a45094a9003f1d0d4ac94cf5e970535fd1
2020-05-20 18:59:21 -07:00
c02e7c464a Replace import cpp_benchmark with torch.utils.cpp_benchmark (#38832)
Summary:
Otherwise, I don't understand how those could have been invoked

Also, what is the benefit of importing the same module twice?
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38832

Differential Revision: D21675081

Pulled By: malfet

fbshipit-source-id: fee5604c4c433161b6b1a999d505b5acbbc3b421
2020-05-20 18:53:09 -07:00
9c88b23fa0 [bug] Binomial distribution BTRS algorithm has small chance of returning -1 (#38456)
Summary:
I was so excited to take advantage of https://github.com/pytorch/pytorch/issues/36858 getting merged that I installed the nightly build, and I'm glad I did!

It turns out that there's a _very small_ chance that the current algorithm will return a negative value (I imagine only -1 is possible but not sure about that).

Basically the logic [here](https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/native/Distributions.h#L198-L213), which returns a value that passes certain checks before checking if its negative. I can't figure out the particular range that causes this but could reproduce it by taking a billion samples with `count` 1 and `prob` 0.9:

```python
(
    torch.distributions.Binomial(
        total_count=torch.tensor(1.0).cuda(), probs=torch.tensor(0.9).cuda()
    ).sample(torch.Size((1000000000,))) >= 0
).all()
```

Reliably evaluates to `tensor(False, device='cuda:0')` on my machine. 100M samples usually does it but not always, so that's around the rate at which this crops up (it took me most of a whole day to run into it!). Seems to be CUDA specific, I imagine due to some esoteric reason I cannot begin to guess.

This PR tries to solve it in the most obvious way: reject negative values _before_ testing the bounding box, not after. But a better solution is probably to figure out why this occurs at all, and stop it.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38456

Differential Revision: D21664886

Pulled By: jerryzh168

fbshipit-source-id: 99b0eed980e214bede484c100388a74d8c40ca55
2020-05-20 17:49:40 -07:00
267a8da1bb Fix broken windows build due per channel quantization stack land. (#38828)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38828

Move newly introduced functions in qnnpack_utils.h inside the ifdef.

Test Plan: CI.

Reviewed By: malfet

Differential Revision: D21672942

fbshipit-source-id: 32e23bb45f5f3f882618e91435a1ae9f80781f97
2020-05-20 17:44:58 -07:00
a7a69ad104 Fast path for contiguous tensor (#38732)
Summary:
A local run shows it improves running 2000 guards time from 0.00282s to 0.00187s (~30%). This is for the case when tensor is contiguous, we don't have to recompute whether it's contiguous from stride for each dimension.
We can further optimize other cases if there's a repro script.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38732

Differential Revision: D21664191

Pulled By: ailzhang

fbshipit-source-id: 125950f20c8676afc447f1d27ce4d14bbd445918
2020-05-20 16:25:11 -07:00
5b8a79ab49 fix the device inconsistency for import convert_sync_batchnorm (#38729)
Summary:
This fixes the device inconsistency reported in https://github.com/pytorch/pytorch/issues/37930
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38729

Differential Revision: D21671039

Pulled By: ngimel

fbshipit-source-id: 17fdb4eae2ddaf64560dd026fe39958536ab313f
2020-05-20 15:42:53 -07:00
6736a76cec Back out "[RPC] [Minor] RPC entry point cleanup"
Summary:
Original commit changeset: b509c47fb612

(Note: this ignores all push blocking failures!)

Reviewed By: xush6528

Differential Revision: D21669711

fbshipit-source-id: e452a513a2d22eaa3bffa333fdb3277fabc24b41
2020-05-20 15:35:24 -07:00
604533ddfa [CircleCI] Add Python3.8-gcc9 config (#38747)
Summary:
`pytorch-linux-bionic-py3.8-gcc9` is based on Ubuntu 18.04 using gcc-9 and python-3.8
`pytorch-linux-bionic-cuda10.2-cudnn7-py3.8-gcc9` adds CUDA-10.2 to the same configuration

Also this in this PR:
 - Updates valgrind to 3.15.0
 - Fixes bug when gcc-5.5 were used in gcc-5.4 configurations
 - Do not install `typing` when installing Python-3.8 from Conda
 - Install `llvmdev-8` to for `numba/llvmlite` package compilation to succeed
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38747

Differential Revision: D21670093

Pulled By: malfet

fbshipit-source-id: 995dfc20118a6945b55a81ef665a0b80dab97535
2020-05-20 14:31:46 -07:00
5b1814e44d Added per channel separate test cases for fc and deconv tests. (#37624)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/37624

Test Plan:
qnnpack tests
fully-connected-test
deconvolution-test

Imported from OSS

Differential Revision: D21339038

fbshipit-source-id: c1f6e9c39de51ab4ab18cd29055b5879e3137f1a
2020-05-20 14:01:48 -07:00
0a554aeed5 Changes to enable per channel support on dynamic linear. (#37623)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37623

Follows the same strategy as static linear.
Same kernel now supports both per-channel and per-tensor linear.
Fixed fully connected test.

Test Plan:
qnnpack tests
q8gemm
fully-connected-test

Imported from OSS

Differential Revision: D21339040

fbshipit-source-id: 479d847c16b42c926acb67357dc3bdd2d0bd6ca4
2020-05-20 14:01:43 -07:00
b8eae1e3b1 Enabled per channel quantized static linear/conv (#37622)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37622

Enable channelwise quantized test on qlinear and qconv.
Dynmaic linear to follow.

Test Plan:
pytest test/quantization/test_quantized.py
pytest test/quantization/test_quantized_module.py

Imported from OSS

Differential Revision: D21339046

fbshipit-source-id: 32377680d3a6424ca1e98d3707b82839eeb349a7
2020-05-20 14:01:37 -07:00
1c9a110b22 Added per channel kernels for depthwise conv. (#37621)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37621

Due to potential perf issues with using same depthwise conv kernels for
perf channel depthwise conv, we opt for replicating the kernels and
adding per channel support to them.
Note that the large kernels files are largely duplication of original
kernels. Assembly kernels have little more modifications than intrinsics
ones.

Test Plan:
qnnpack tests.
q8dwconv-test
convolution-test

Differential Revision: D21339042

Pulled By: kimishpatel

fbshipit-source-id: f2c3413e1e1af0b1f89770b5e0f66f402d38aee8
2020-05-20 14:01:31 -07:00
1f16d4ce1c Changes to enable per channel requant. (#37620)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37620

Now channel wise quantization is supported for linear/conv.
Depthwise conv are still pending.
Tests are altered to generate per channel zero points and requant
scales.
All the kernels are fixed appropritately.
Added per_channel member to conv_param structure.
And replicated conv tests to exercise per_channel conv.
This was not strictly needed since conv kernels were changed
such that they did per channel anyway. When per channels is not needed
zp and scale were same across channels. This was to minimize code
duplicaiton as the perf impact is estimated (to be measured though) to
be low.
However this is not likely the case for depthwise convs. Thus they will
have separate kernels, which required us to introduce per_channel member
to conv_param structure, to know which kernels to apply for depthwise.
Ensuing modifications were to keep everything in
sync for both regular conv and depthwise so that we dont have caveat
when reading the code, that why does depthwise have separate test for
per channel and non-depthwise conv does not.

Test Plan:
Via tests inside qnnpack, i.e., q8gemm-test, q8conv/dwconv test.
fully-conntected-test, convolution-test.

Imported from OSS

Differential Revision: D21339041

fbshipit-source-id: 1b8fbd7fbd0fe0582a43996147171567b126d948
2020-05-20 14:01:26 -07:00
622f5b68f0 Enable per channel zero point. (#37619)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37619

This PR introduces changes to add per channel zero point.
Modifies kernels appripriately.
Some bug fixes in enabling per channel zero point.

Test Plan:
Via tests inside qnnpack, i.e., q8gemm-test, q8conv/dwconv test.
fully-conntected-test, convolution-test.

Imported from OSS

Differential Revision: D21339044

fbshipit-source-id: fb69488b2b04da109c69f3dd1e8a285babf2863d
2020-05-20 14:01:20 -07:00
f1991ca8e7 Interface changes to enable per channel quant. (#37618)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37618

This does not do any actual changes. Just introduces API changes and
some data struct changes to hold vector of data for zero point and
scale.

Test Plan:
Via unittests inside qnnpack, i.e., q8gemm-test, q8conv/dwconv test.
fully-conntected-test, convolution-test.
PT's quantization tests.

Imported from OSS

Differential Revision: D21339039

fbshipit-source-id: 4a20cff9795a105ddd31482d1f1fe2b1dbe18997
2020-05-20 13:59:47 -07:00
96d7defb4b Revert D21593870: Kill AT_DISPATCH_ALL_TYPES_AND_C10_COMPLEX_AND2
Test Plan: revert-hammer

Differential Revision:
D21593870

Original commit changeset: b4edaa001e76

fbshipit-source-id: 6ccbe58fa58c1b529cb953e87d3235831765c856
2020-05-20 13:25:05 -07:00
3b254acd99 support complex types for tanh_cuda and tanh_backward_cuda (#38786)
Summary:
Builds on https://github.com/pytorch/pytorch/issues/37791
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38786

Differential Revision: D21666138

Pulled By: anjali411

fbshipit-source-id: cbd313b8fd21109aadd614c60259b9dc505771a5
2020-05-20 12:57:40 -07:00
51b25218c0 Remove deprecated cuDNN API from caffe2 (#38680)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/38680

Differential Revision: D21656332

Pulled By: ngimel

fbshipit-source-id: 8cf8040d68d849848cc0e0ad35849a5757f7eaf8
2020-05-20 12:55:58 -07:00
90400f48fc Enforce tensorboard minimum version as 1.15 (#35952)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/34028
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35952

Differential Revision: D20870095

Pulled By: sanekmelnikov

fbshipit-source-id: f7de26a538841d832df3179f49dfa2145e98fcdc
2020-05-20 12:41:02 -07:00
4b248393b7 Kill AT_DISPATCH_ALL_TYPES_AND_C10_COMPLEX_AND2 (#38459)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/38459

Test Plan: Imported from OSS

Differential Revision: D21593870

Pulled By: anjali411

fbshipit-source-id: b4edaa001e767e9c93bc75e907ee157ec866568a
2020-05-20 12:25:38 -07:00
c82b873dbf Migrate CPU min max to c10::complex (#38461)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/38461

Test Plan: Imported from OSS

Differential Revision: D21663871

Pulled By: anjali411

fbshipit-source-id: 649a5af9ec7b428b155ed02740ae845f65a849be
2020-05-20 12:14:08 -07:00
c039540d10 [Onnxifi] optimize the dispatcher ordering (#38766)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38766

We will most likely hit int32, float16 and float inputs are Onnxifi inputs.

Test Plan: runs

Reviewed By: ipiszy

Differential Revision: D21658148

fbshipit-source-id: c51917c29e223051c5dfa1c21788c6d620539562
2020-05-20 12:04:32 -07:00
d8b9448c62 [pytorch] reorder tracer code in generated VariableTypes (#38308)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38308

This PR doesn't add any new functionality. The purpose of this PR
is to validate reordering tracing code in variable kernel doesn't break
anything (which is a prerequisite of stacked change of moving tracing
logic into a new dispatch backend).

And it will be easier to bisect in case it breaks something which is not
covered by tests.

Test Plan: Imported from OSS

Differential Revision: D21570685

Pulled By: ljk53

fbshipit-source-id: 616b6434326df8381fb6f07c7b9aadac86dd02b4
2020-05-20 11:47:21 -07:00
cae45e416e add skipIfRocm to TestAutograd.test_memory_profiler (#38790)
Summary:
CC ezyang xw285cornell sunway513

Skip new test until triage of ROCm CI can be completed.

Test added by a94fb71b126001630d3d1e350347c20977f14ec0.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38790

Differential Revision: D21665404

Pulled By: xw285cornell

fbshipit-source-id: c03227a91c9d06f8c0ff50f4593baa9ecb507743
2020-05-20 11:41:24 -07:00
fe66bdb498 port masked_select from TH to ATen and optimize perf on CPU (#33269)
Summary:
This PR ports `masked_select` from TH to ATen and optimize the performance on CPU with TensorIterator.

https://github.com/pytorch/pytorch/issues/33053

1. single socket run: up to **5.4x** speedup;
2. single core run: up to **1.16x** speedup.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33269

Differential Revision: D20922288

Pulled By: ngimel

fbshipit-source-id: 38e183a4e3599bba29bbbebe36264026abe1c50e
2020-05-20 11:36:29 -07:00
f4f0dd470c Migrate CPU clamp to c10::complex (#38460)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/38460

Test Plan: Imported from OSS

Differential Revision: D21663855

Pulled By: anjali411

fbshipit-source-id: 2fa5e17ec12f4eabeb58c5edb8f2459407b1b3f9
2020-05-20 10:56:42 -07:00
ca1978c9db For jobs need a merge, merge with origin/master for ghstack PRs. (#38745)
Summary:
ghstack PRs has target branch changed to `gh/xxx/1234/base` so the merge didn't work. Change it to `master` by default.
IIRC we don't use ghstack with release branches so this should be good? cc: ezyang
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38745

Differential Revision: D21663796

Pulled By: ailzhang

fbshipit-source-id: 3d2c7b91b0e355dc8261d8f1e7da76af8d3bcee4
2020-05-20 10:34:31 -07:00
8666ea0cd1 Remove duplicated entries in native_functions.yaml (#38389)
Summary:
`use_c10_dispatcher: full` appears twice in some entries.

This PR removes duplicated ones.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38389

Differential Revision: D21549098

Pulled By: zou3519

fbshipit-source-id: 4e456f740d5b4d4519650c0854a273d87fbc5f09
2020-05-20 09:21:23 -07:00
c78691b4a6 [CPU] torch.gather for complex dtypes (#36430)
Summary:
This PR resolves https://github.com/pytorch/pytorch/issues/36340 .
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36430

Differential Revision: D21662139

Pulled By: anjali411

fbshipit-source-id: 361d064c1144b368afae3059c19f77abe26080a3
2020-05-20 09:15:14 -07:00
a3bab37d96 Add BatchedTensorImpl (#38424)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38424

On the way to adding initial vmap support, this is the implementation
for BatchedTensorImpl. Vmap (in future PRs) leverages Tensors backed by
BatchedTensorImpl to do its work.

For more context, here is an overview of the plan to add initial vmap support.
- [this PR] Add BatchedTensorImpl
- Add one or two batching rules
- Add vmap Python API
- Add "slow" for-loop fallbacks for out-of-place functions via
dispatcher fallback mechanism.
- Add batching rules for "view" functions
- Add "slow" for-loop fallbacks for in-place functions
- Miscellaneous handling for failure cases
- And more

Test Plan: - `./build/bin/vmap_test`

Differential Revision: D21640917

Pulled By: zou3519

fbshipit-source-id: 969490a838cf2099ed80104e7d51ee8ff069e168
2020-05-20 09:10:00 -07:00
c60daedb36 Migrate CPU eye to c10::complex (#37899)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/37899

Test Plan: Imported from OSS

Differential Revision: D21554155

Pulled By: anjali411

fbshipit-source-id: d8beffc0e7356effa4f434635db3d2d60263b035
2020-05-20 08:16:08 -07:00
1465970a34 Update valgrind version build from source (#38754)
Summary:
Why not use valgrind-3.15.0?
Also, build in in parallel (with -j4)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38754

Differential Revision: D21657357

Pulled By: malfet

fbshipit-source-id: 22b7761a6c9672477e32f16e56f58bdcce02a75c
2020-05-19 23:29:49 -07:00
42870ddf24 Generate Dynamic Shapes (#37693)
Summary:
Yay!
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37693

Differential Revision: D21641663

Pulled By: Krovatkin

fbshipit-source-id: 64e70138b31800371887d24ceb1c5d18945b4412
2020-05-19 23:17:54 -07:00
9e910a95b0 Add torch_python and _C library to bazel build (#38707)
Summary:
Split `generated_sources` into `cpp_generated_sources` and `python_generated_sources`
Add `shm` and `_C` library definitions
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38707

Test Plan: `bazel build :_C.so; pushd bazel-bin/; python -c 'import _C;print(dir(_C))'; popd

Differential Revision: D21654868

Pulled By: malfet

fbshipit-source-id: dd5f78c38fe58e5ab4cccd3eee42706f44af7989
2020-05-19 22:52:30 -07:00
530d48e93a [quant] Support for fused ConvBn1d and ConvBnRelu1d modules (#38452) (#38749)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/38749

Test Plan: python test/test_quantization.py TestFused

Differential Revision: D21654659

Pulled By: supriyar

fbshipit-source-id: 301be24083e794f4e71ff1d6d842e1aaefa640f0
2020-05-19 22:48:05 -07:00
7587188037 Skips test_float_to_int_conversion_finite on MacOS (#38753)
Summary:
See https://github.com/pytorch/pytorch/issues/38752.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38753

Differential Revision: D21656330

Pulled By: mruberry

fbshipit-source-id: f1f97228f31b8a0b0535b3168a7d209fefff2769
2020-05-19 21:56:48 -07:00
40ce90bfc1 Revert D21560096: [Tensorpipe Agent] Enabling tests with OSS CI
Test Plan: revert-hammer

Differential Revision:
D21560096

Original commit changeset: 7d61cc1c354e

fbshipit-source-id: 6adfd87e354545031203d65d04f0bad4687a93cd
2020-05-19 19:39:33 -07:00
64584573f9 Updates tests for integer division deprecation (#38621)
Summary:
Updates our tests in preparation of integer division using torch.div and torch.addcdiv throwing a runtime error by avoiding integer division using torch.div. This creates a brief period where integer division using torch.div is untested, but that should be OK (since it will soon throw a runtime error).

These callsites were identified using https://github.com/pytorch/pytorch/issues/36897.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38621

Differential Revision: D21612823

Pulled By: mruberry

fbshipit-source-id: 749c03a69feae02590b4395335163d9bf047e162
2020-05-19 19:28:00 -07:00
5af4e76683 Back out "Revert D21530545: Remove call_unboxed_super_slow_temp_shim" (#38742)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38742

Original commit changeset: af9013ed37d2
ghstack-source-id: 104397898

Test Plan: waitforsandcastle

Differential Revision: D21651660

fbshipit-source-id: 8bb56eb8abd43fd01d1468f104babe92a09d2ad4
2020-05-19 18:23:20 -07:00
5fb26b1022 Delete cuda9-cudnn7 build which is not defined in build.sh (#38750)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/38750

Test Plan: `grep -R pytorch-linux-xenial-cuda9-cudnn7-py3 .circleci`

Differential Revision: D21654262

Pulled By: malfet

fbshipit-source-id: a20cba15e9a24e9cbca7a8111d9149a9ae725886
2020-05-19 18:19:52 -07:00
9907a3eb65 Update Argmin/Argmax ONNX Export (#38329)
Summary:
Update Argmin/Argmax ONNX export in opset 12 to export with "select_last_index", and export correctly cases where the same value appears multiple time in the input tensor.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38329

Reviewed By: hl475

Differential Revision: D21613799

Pulled By: houseroad

fbshipit-source-id: 4597e23561f444c4e56d30c735dae7e9a8a41c5e
2020-05-19 16:56:33 -07:00
cbd0adc7b4 Migrate CPU unary ops to c10::complex (#37898)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/37898

Test Plan: Imported from OSS

Differential Revision: D21554156

Pulled By: anjali411

fbshipit-source-id: 846319dd08d0e3ed3d387cf484360508fb123c81
2020-05-19 16:54:02 -07:00
bcf8973654 Add torch.utils.cmake_prefix_path pointing to share/cmake folder (#38559)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/38559

Test Plan: Make sure that `cmake path/to/CMakeLists.txt -DCMAKE_PREFIX_PATH=`python -c 'import torch;print(torch.utils.cmake_prefix_path)'` succeeds for CMake projects which depends on Torch package

Differential Revision: D21644066

Pulled By: malfet

fbshipit-source-id: c8e3cb2cbd7f969fadea6a3bccc41c4edb3ec546
2020-05-19 16:49:16 -07:00
363a2d9455 Revert D21530545: Remove call_unboxed_super_slow_temp_shim
Test Plan: revert-hammer

Differential Revision:
D21530545

Original commit changeset: cdfb801e5519

fbshipit-source-id: af9013ed37d27bf8dca859902918c02eb8cceeb4
2020-05-19 16:07:36 -07:00
235f62417d Fixes for profiling JIT code (#38453)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38453

Two fixes:
 - RecordFunction in JIT interpreter should exist during the execution
   of the frame, and not just when we enter the frame
 - When creating a JIT continuation in wait instruction, we'd want to
   preserve the original thread local context, right now when we resume
   execution in continuation we preserve the thread local state of the
   thread that set future value (i.e. executed a forked task)

Test Plan: unittest, CI

Reviewed By: ngimel

Differential Revision: D21565959

Pulled By: ilia-cher

fbshipit-source-id: 206b98e3bfb0052fc8e4031da778e372cc71afc1
2020-05-19 15:50:42 -07:00
a94fb71b12 Memory profiling (#37775)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37775

Adding memory usage into profiler table output

Test Plan:
BUILD_BINARY=1 USE_BLAS=MKL USE_MKLDNN=0 USE_CUDA=0 python setup.py
develop install --cmake

```
import torch
import torchvision.models as models
model = models.resnet18()
inp = torch.randn(5, 3, 224, 224)

with torch.autograd.profiler.profile(profile_memory=True, record_shapes=True) as prof:
    model(inp)

print(prof.key_averages(group_by_input_shape=True).table(sort_by="cpu_memory_usage", row_limit=15))
```

```
---------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  -----------------------------------
Name                         Self CPU total %  Self CPU total   CPU total %      CPU total        CPU time avg     CPU Mem Total    Number of Calls  Input Shapes
---------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  -----------------------------------
resize_                      0.37%            577.936us        0.37%            577.936us        9.796us          339.03 Mb        59               [[0]]
empty                        0.69%            1.061ms          0.74%            1.139ms          5.556us          47.42 Mb         205              []
stride                       0.00%            0.853us          0.00%            0.853us          0.853us          19.53 Kb         1                [[5, 1000]]
empty_strided                0.01%            21.393us         0.02%            26.033us         5.207us          252 b            5                []
is_complex                   0.02%            37.425us         0.02%            37.425us         1.291us          208 b            29               [[]]
masked_select                0.04%            55.333us         0.06%            93.616us         46.808us         120 b            2                [[30], [30]]
conv2d                       0.01%            18.009us         9.62%            14.902ms         14.902ms         0 b              1                [[5, 3, 224, 224], [64, 3, 7, 7], [
convolution                  0.01%            12.436us         9.61%            14.884ms         14.884ms         0 b              1                [[5, 3, 224, 224], [64, 3, 7, 7], [
_convolution                 0.03%            52.381us         9.60%            14.871ms         14.871ms         0 b              1                [[5, 3, 224, 224], [64, 3, 7, 7], [
size                         0.00%            5.429us          0.00%            5.429us          0.339us          0 b              16               [[5, 3, 224, 224]]
contiguous                   0.00%            1.934us          0.00%            1.934us          0.967us          0 b              2                [[5, 3, 224, 224]]
_convolution_nogroup         0.02%            27.505us         9.57%            14.814ms         14.814ms         0 b              1                [[5, 3, 224, 224], [64, 3, 7, 7], [
_nnpack_available            0.02%            34.267us         0.02%            34.267us         1.713us          0 b              20               []
thnn_conv2d                  0.01%            13.274us         9.54%            14.771ms         14.771ms         0 b              1                [[5, 3, 224, 224], [64, 3, 7, 7], [
thnn_conv2d_forward          5.98%            9.264ms          19.02%           29.446ms         14.723ms         0 b              2                [[5, 3, 224, 224], [64, 3, 7, 7], [
---------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  -----------------------------------
Self CPU time total: 154.855ms
```

Reviewed By: ngimel

Differential Revision: D21384248

Pulled By: ilia-cher

fbshipit-source-id: 31359cce2aa06f6255ed1ad8c60d03cb640bfec3
2020-05-19 15:48:48 -07:00
24b48372b9 Revert D21626921: override gcc version in cuda related test
Test Plan: revert-hammer

Differential Revision:
D21626921

Original commit changeset: b645845aa831

fbshipit-source-id: 148a2ee5184b0252ff7f31131ab87671673235bf
2020-05-19 15:41:08 -07:00
b995540a01 Revert D21632878: [quant] Support for fused ConvBn1d and ConvBnRelu1d modules
Test Plan: revert-hammer

Differential Revision:
D21632878

Original commit changeset: 0d73398b95d7

fbshipit-source-id: c4dd18a4220d175237f31f741a782f2596228009
2020-05-19 15:22:16 -07:00
87b198d309 add distributed/test_nccl to ROCM_BLACKLIST (#38730)
Summary:
CC ezyang xw285cornell sunway513

Work-around for recent ROCm CI failures due to 9cfc10d52e0d0a8576b0a5a347fa6fa8da86244a (https://github.com/pytorch/pytorch/issues/37294).  Replaces full revert suggested by PR https://github.com/pytorch/pytorch/issues/38689.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38730

Differential Revision: D21648707

Pulled By: xw285cornell

fbshipit-source-id: 627b11b229c7eadca1f6e0c6192c6b5b6416e6a1
2020-05-19 14:45:50 -07:00
befc76bb65 [RPC] [Minor] RPC entry point cleanup (#34292)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34292

This is to finish a cleanup request from https://github.com/pytorch/pytorch/pull/34733#discussion_r392479110.

ghstack-source-id: 104361618

Test Plan:
```
buck test mode/dev-nosan //caffe2/test/distributed/rpc/jit:rpc_fork

buck build mode/dev-nosan //caffe2/test/distributed/rpc/jit:rpc_fork && \
buck-out/gen/caffe2/test/distributed/rpc/jit/rpc_fork\#binary.par \
-r test_return_local_script_class_rref_in_py_and_use_in_script

buck build mode/dev-nosan //caffe2/test/distributed/rpc/jit:rpc_fork && \
buck-out/gen/caffe2/test/distributed/rpc/jit/rpc_fork\#binary.par \
-r test_return_local_script_module_rref_in_py_and_use_in_script
```

Differential Revision: D7436759

fbshipit-source-id: b509c47fb612ec3486ff1199c005eba69480ee05
2020-05-19 14:23:11 -07:00
423a00ad39 Remove call_unboxed_super_slow_temp_shim (#38351)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38351

ghstack-source-id: 104368838

Test Plan: waitforsandcastle

Differential Revision: D21530545

fbshipit-source-id: cdfb801e551993ecb339f3f8ec7c9b3039766989
2020-05-19 14:19:28 -07:00
959afe0726 Overload bitwise NOT, AND, OR, XOR operators for at::Tensor (#38691)
Summary:
Fix https://github.com/pytorch/pytorch/issues/38546
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38691

Differential Revision: D21640554

Pulled By: ezyang

fbshipit-source-id: 407f210a74d35837abf5a68b82ad5ab8d2d3902d
2020-05-19 14:16:07 -07:00
ab169fa5ac Fix find_first_set for x86 MSVC (Updated) (#38706)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/38706

Differential Revision: D21640477

Pulled By: ezyang

fbshipit-source-id: fc7ff5c35fc3f776553e93f485532cc805a2af9c
2020-05-19 14:14:56 -07:00
87aa2d25ae [Tensorpipe Agent] Enabling tests with OSS CI (#38447)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38447

This PR modifies `run_tests.py` to enable running Tensorpipe Agent tests with the OSS CI.
ghstack-source-id: 104321881

Test Plan: CI

Differential Revision: D21560096

fbshipit-source-id: 7d61cc1c354e9353c4a586dd2b56690c28d51d10
2020-05-19 13:34:06 -07:00
b2991c105a [Tensorpipe Agent] Dist Optimizer Tests for Tensorpipe Agent (#38446)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38446

This PR enables the Distributed Optimizer tests for the Tensorpipe Agent - all of them are currently passing so there is no need to skip any tests.
ghstack-source-id: 104321883

Differential Revision: D21560097

fbshipit-source-id: 316971b96b632f12326872a51fd9124c9eae4720
2020-05-19 13:34:00 -07:00
b782ad3b9e [Tensorpipe Agent] Dist Autograd Tests for Tensorpipe Agent (#38445)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38445

This PR enables the Distributed Autograd tests for the Tensorpipe Agent. A decorator is used to skip all tests that are currently failing due to functionality lacking in the Tensorpipe RPC Agent (primarily timeouts and error handling).
ghstack-source-id: 104321884

Differential Revision: D21560098

fbshipit-source-id: 2564bfc96d196f35ef0dfb9de59791fcd29093cf
2020-05-19 13:33:55 -07:00
7492e98c7f [Tensorpipe Agent] RPC, RRef tests for Tensorpipe Agent (#38444)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38444

This enables the RPC/RRef test suites to run with the Tensorpipe RPC Agent. This creates a new fixture to ensure the backend/options used are Tensorpipe, as well as a decorator to skip tests that Tensorpipe currently cannot support due to missing functionality.

One small note: the decorator function is a class method of the test class so we can check whether `self.rpc_backend` is tensorpipe. In the class-scope, the `TEST_CONFIG.rpc_backend_name` string is set to Tensorpipe, but outside the class scope, it is PGA, possibly due to importing dist_utils which sets this config to PGA by default. The cleanest solution would be to refactor the backend selection to be more uniform (since currently every backend is set slightly differently), but that would be a longer-term fix.
ghstack-source-id: 104321885

Test Plan:
Note: A couple of these tests will fail right now due to missing features. I've skipped the ones that regularly fail, but there will be some flaky tests that still fail occasionally.

The decorator `@_skip_if_tensorpipe_agent` skips the tests that fail with the Tensorpipe Agent. Remove this decorator from above the tests once they are fixed.

Differential Revision: D21412016

fbshipit-source-id: 1e801ac5ccaf87974dd4df92d556895b01468bf3
2020-05-19 13:32:58 -07:00
55914f8e83 Add skipCUDAIfRocm to test_nn test_softmax_results. (#38724)
Summary:
CC ezyang xw285cornell sunway513

Commit 59d92e442b88eae51b84adc4e902e36e8f12a4db (https://github.com/pytorch/pytorch/issues/38557) has caused this test to regularly fail on ROCm CI gfx900 hosts.  Skipping test until root cause analysis can complete.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38724

Differential Revision: D21645815

Pulled By: xw285cornell

fbshipit-source-id: 4087e9565710c271ca5c026a5ae0c5132e56f44d
2020-05-19 13:20:34 -07:00
9ad14f6b43 cover nn.Conv1d in mkldnn model conversion logic (#38528)
Summary:
current `to_mkldnn` model conversion logic under `torch.utils.mkldnn` does not cover `nn.Conv1d`. This patch fills the gap, using similar logic to `nn.Conv2d`. The model conversion will remove unnecessary memory format reorders of input/output tensors and thus speedup the model.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38528

Differential Revision: D21640325

Pulled By: albanD

fbshipit-source-id: c3340153b5c524e020c097eb4b9e2ffcbde8896d
2020-05-19 13:04:18 -07:00
6fd48e24f1 Add support, test for kwargs in jit._fork (#38357) (#38665)
Summary:
Closing 38357
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38665

Reviewed By: suo

Differential Revision: D21643697

Pulled By: wconstab

fbshipit-source-id: c292c037f87bc2bb69a4ca163d7107d5396c53a2
2020-05-19 13:02:46 -07:00
819da00b3d Fixes floordiv dunder registrations (#38695)
Summary:
floordiv was missing a couple dunder registrations, which was causing __ifloordiv__ to not be called when it should. This adds the appropriate registrations and adds a test verifying that the inplace dunders are actually occuring inplace.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38695

Differential Revision: D21633980

Pulled By: mruberry

fbshipit-source-id: a423f5ec327cdc062fd6d9d56abd36fe44ac8198
2020-05-19 12:11:38 -07:00
1ef77f9045 [quant][graphmode] Different rule for handling aten::cat (#38570)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38570

We changed the rule of quantizing `aten::cat`, previously `aten::cat` is considered to be
an op that should always be quantized, like `aten::conv2d`, but this is not ideal, a better
way is to quantize the output of `aten::cat` depending on whether the input is quantized, if it is
then we'll quantize the output, if not, then we will not quantize the output, since `aten::cat` works both on
quantized and non-quantized tensor.

Test Plan: Imported from OSS

Differential Revision: D21600160

fbshipit-source-id: efa957e0eaa608fffefcdfefa7f442fab45605eb
2020-05-19 11:23:35 -07:00
dfbf9f397f Back out "Back out "[c2] register cuda op for LpNorm (fallback)"" (#38566)
Summary:
Previously we got a CI issue in original submission (D21562485), so we backout the original diff (D21588831). Resubmitting here to reprod the CI issue and ask caffe2 dev to take a look.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/38566

Original commit changeset: 6dda4b71904d

Test Plan: buck test

Reviewed By: houseroad

Differential Revision: D21589352

fbshipit-source-id: de40ff2884019e14476e31c4c952f24d6e438f5f
2020-05-19 10:37:25 -07:00
54d4b419db fix clip_grad_norm to work with parameters on the different devices (#38615)
Summary:
Per title.
We move all the individual gradient norms to a single device before stacking (no-op if all the gradients are already on a single device), `clip_coef` is copied to the device of gradient, which may be suboptimal as there could be multiple copies, but no worse than when we were synchronizing for each parameter. In a simple case of all gradients on a single device, there should be no synchronization.
Also, we no longer error out if parameter list is empty or none of the parameters have gradients, and return 0 total_norm instead.
Fixes https://github.com/pytorch/pytorch/issues/38605
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38615

Reviewed By: ailzhang

Differential Revision: D21634588

Pulled By: ngimel

fbshipit-source-id: ea4d08d4f3445438260052820c7ca285231a156b
2020-05-19 10:33:40 -07:00
b14734d92e Add bfloat16 to CPU cauchy_kernel, log_normal_kernel, exponential_kernel (#38427)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/38427

Test Plan: Imported from OSS

Differential Revision: D21640640

Pulled By: pbelevich

fbshipit-source-id: 9cff8f6b5c33b3b31753c76fc8033d329b218019
2020-05-19 10:21:36 -07:00
35beff0b9f RNG infrastructure improvements (#37984)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37984

- `NumericUtils.h`
CUDA distribution kernels had two variants of transformation labdas(`uniform`/`normal` -> `lognormal`/`exponential`/`cauchy`/`geometric`...): for double-precision and optimized for CUDA single precision. It was done by using `::log`/`__logf`, `::exp`/`__expf` and `::tan/__tanf`. I moved them to `NumericUtils.h` and called them `at::exp`, `at::log` and `at::tan`. It allowed to unify CPU/CUDA transformation templates in `TransformationHelper.h`.

- `DistributionsHelper.h`
Made `normal_distribution`, `geometric_distribution`, `exponential_distribution`, `cauchy_distribution`, `lognormal_distribution` C10_HOST_DEVICE compatible to reuse them in CPU/CUDA distribution kernels.
Replaced explicit math with transformations from `TransformationHelper.h`

- `TransformationHelper.h`
Renamed `*_transformation` to `transformation::*`
Added clear unified host/device transformations templates `normal`, `cauchy`, `exponential`, `geometric`, `log_normal` which are used by both CPU and CUDA distribution kernels and custom PRNG distribution kernels.

- `cpu/DistributionTemplates.h`
Unified `normal_kernel`, `cauchy_kernel`, `log_normal_kernel`, `geometric_kernel`, `exponential_kernel`.

- `cuda/DistributionTemplates.h`
Extracted `UNIFORM_AND_TRANSFORM` and `NORMAL_AND_TRANSFORM` macros to reuse code between distribution kernel templates.
Unified transformation labdas(`uniform`/`normal` -> `lognormal`/`exponential`/`cauchy`/`geometric`...)

- `test_torch.py`
Added `scipy.stats.kstest` [Kolmogorov–Smirnov](https://en.wikipedia.org/wiki/Kolmogorov%E2%80%93Smirnov_test) tests for `uniform`/`normal`/`lognormal`/`exponential`/`cauchy` distributions and [Chi-squared](https://en.wikipedia.org/wiki/Chi-squared_test) test for `geometric` one. To make sure that our distributions are correct.

- `cpu_rng_test.cpp`, `rng_test.h`
Fixed random_()'s from and to bounds issue for floating-point types, fixed cast/overflow warnings

- `THTensorRandom.h`, `THVector.h`
Moved unnecessary includes to `THTensorRandom.cpp`

Test Plan: Imported from OSS

Differential Revision: D21477955

Pulled By: pbelevich

fbshipit-source-id: 7b793d1761a7a921c4b4a4a7d21d5d6c48f03e72
2020-05-19 10:20:39 -07:00
7d38db0f9a [quant] Support for fused ConvBn1d and ConvBnRelu1d modules (#38452)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/38452

Test Plan:
python test/test_quantization.py TestFused

Imported from OSS

Differential Revision: D21632878

fbshipit-source-id: 0d73398b95d72a0a23b42ef36f3ede1bfcc35eda
2020-05-19 09:53:56 -07:00
320c35681d [TensorExpr] (trivial) unique Kernel input names (#38678)
Summary:
We have a bug where Function names are not uniqued which produces bad printed output, e.g:
```
{
  for (int i0 = 0; i0 < 1024; i0++) {
    input[i0] = t0[0 + i0 * 1];
  }
  for (int i0_1 = 0; i0_1 < 1024; i0_1++) {
    input_1[i0_1] = t1[0 + i0_1 * 1];
  }
  for (int v = 0; v < 1024; v++) {
    aten_add[v] = (input(v)) + float(1) * (input(v));
  }
  for (int v_1 = 0; v_1 < 1024; v_1++) {
    aten_sub[v_1] = (aten_add(v_1)) - float(1) * (input(v_1));
  }
}
```

Notice the names of the vars in the `aten_add` line which make it appear as though input_1 isn't used. This is because the Buf names are uniqued by the unique_name_manager but the FunctionCall names are not.

Not fixing this right now, but working around it by reducing the number of Tensors that are created with the same name ("input") in kernel.cpp. That example now looks like:
```
{
  for (int i0 = 0; i0 < 1024; i0++) {
    input1[i0] = t0[0 + i0 * 1];
  }
  for (int i0_1 = 0; i0_1 < 1024; i0_1++) {
    input2[i0_1] = t1[0 + i0_1 * 1];
  }
  for (int v = 0; v < 1024; v++) {
    aten_add[v] = (input1(v)) + float(1) * (input2(v));
  }
  for (int v_1 = 0; v_1 < 1024; v_1++) {
    aten_sub[v_1] = (aten_add(v_1)) - float(1) * (input1(v_1));
  }
}
```

To be clear, the bug still exists but it's not blocking what I'm trying to do right now 😄
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38678

Differential Revision: D21630276

Pulled By: nickgg

fbshipit-source-id: 39dec2178cf492302bc5a61e1e688ae81513858a
2020-05-19 09:49:24 -07:00
e5ada042b1 QAT ConvBN: remove explicit folding and use BN instead (#38478)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38478

Before this PR, the QAT ConvBN module inlined the batch normalization code
in order to reproduce Conv+BN folding.

This PR updates the module to use BN directly.  This is mathematically
equivalent to previous behavior as long as we properly scale
and fake quant the conv weights, but allows us to reuse the BN code
instead of reimplementing it.

In particular, this should help with speed since we can use dedicated
BN kernels, and also with DDP since we can hook up SyncBatchNorm.

Test Plan:
```
python test/test_quantization.py TestQATModule
```

Imported from OSS

Differential Revision: D21603230

fbshipit-source-id: ecf8afdd833b67c2fbd21a8fd14366079fa55e64
2020-05-19 08:58:42 -07:00
8d64986202 Fix target determination file diffing (#38661)
Summary:
It seems like all this time this was accidentally doing a 3-way merge-base, oops.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38661

Test Plan:
```
$ git checkout gh/mohammadmahdijavanmard/1/head
$ git merge-base origin master HEAD --all
8292742ba020fcff90f14418c18741ebf606103b
$ git merge-base origin/master HEAD --all
324dc1623e2f91892038fb1b151450a7c6529dd9
```

Differential Revision: D21640939

Pulled By: yns88

fbshipit-source-id: 0f59922e7c0fd046f48fec30e8aa25c244f6dd62
2020-05-19 08:47:41 -07:00
f3b5c22dba Update On "check-doxygen.sh must be run from docs/cpp/source director… (#38641)
Summary:
…y" & "check-doxygen.sh suppress stderr output"

Fixes https://github.com/pytorch/pytorch/issues/36974
Fixes https://github.com/pytorch/pytorch/issues/36975

ezyang
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38641

Differential Revision: D21640474

Pulled By: ezyang

fbshipit-source-id: f25b373a3459a1a315c009fc75fdb37d4ab6d67c
2020-05-19 07:51:38 -07:00
f6f1384811 [JIT] Refactor attributes to support buffers and parameters as first class citizens, add support for iterating over named_buffers() (#37905)
Summary:
First part of https://github.com/pytorch/pytorch/issues/36211 - still a WIP, but asking for commentary to ensure this is the direction we want to go in.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37905

Differential Revision: D21633735

Pulled By: voznesenskym

fbshipit-source-id: f4e4302e40114513776c9e48867a90d72049e2e9
2020-05-18 23:23:43 -07:00
c4d3b042e8 Cleanup BUILD.bazel (#38699)
Summary:
Use recursive glob to make `aten_headers` and `torch_headers` declaration more compact
Use list generator to define torch_cpp_api tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38699

Differential Revision: D21635357

Pulled By: malfet

fbshipit-source-id: ecab437d471b6be0c3caf669d4f59fcda9409249
2020-05-18 22:02:42 -07:00
1a3f646b9c Regenerate .circleci/config.yml (#38705)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/38705

Differential Revision: D21635212

Pulled By: malfet

fbshipit-source-id: e14b1f150689187049530de32aca8773a8d37264
2020-05-18 21:52:57 -07:00
ddfd720e5d Redundant schema registration Prevention for Manually Boxed Wrappers (#38588)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/38588

Test Plan: Imported from OSS

Reviewed By: ezyang

Differential Revision: D21508186

Pulled By: MohammadMahdiJavanmard

fbshipit-source-id: 1fc8f29d0eb107f847d3ae90e6728c38337f2808
2020-05-18 21:41:56 -07:00
5e55f0805f override gcc version in cuda related test (#38675)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/38675

Test Plan: Imported from OSS

Differential Revision: D21626921

Pulled By: glaringlee

fbshipit-source-id: b645845aa831cb64078fe2309881038138abb443
2020-05-18 21:20:52 -07:00
fc19747d64 handle grad with stride=0 on GPU MvBackward (#38321)
Summary:
References : https://github.com/pytorch/pytorch/issues/38315 ,  https://github.com/pytorch/pytorch/issues/29984

cuBlas expects strides to be greater than 0.
Cloning the `grad` allocates a new vector with
non-zero strides.

For CPU, we don't clone and allocate a new vector
as CPU implementation works with stride=0.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38321

Differential Revision: D21628966

Pulled By: ngimel

fbshipit-source-id: 390caf835af6d1d77ed537b7fcc113a22c3ec301
2020-05-18 20:53:36 -07:00
86397f6b24 [quant] Remove get_qparams in Observers (#38435)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/38435

Test Plan: Imported from OSS

Differential Revision: D21597835

Pulled By: jerryzh168

fbshipit-source-id: 88a8dd110db5586509bf98fa6712290f1756c272
2020-05-18 20:49:33 -07:00
d5461e7ac8 [quant][graphmode] Move processing code to prepare_script (#38669)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/38669

Test Plan: Imported from OSS

Differential Revision: D21623385

fbshipit-source-id: a59630de47f4927ae8af3801240101d307901671
2020-05-18 20:18:11 -07:00
91163addf8 organize verbatim sources with subdirectories (#38688)
Summary:
re-use master-only branch filter logic and docker constants
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38688

Differential Revision: D21634022

Pulled By: kostmo

fbshipit-source-id: 8a47af7fb08fb77e8f2ea376b564a84abca1ad50
2020-05-18 19:58:20 -07:00
f184ec819d Do not use "buffer" in reentrant autograd err msg (#38625)
Summary:
`buffer` is also used to refer to `nn.Module`'s buffer. Wording is changed to reduce confusion between the two.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38625

Differential Revision: D21629396

Pulled By: albanD

fbshipit-source-id: acb5ef598739efabae7b388e1a4806c9caf0f589
2020-05-18 19:31:21 -07:00
958313a79f Fix memory usage increase reported in #38568 (#38674)
Summary:
update to in-place version for bias add in convolution, this saves unnecessary memory allocation.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38674

Differential Revision: D21626080

Pulled By: ngimel

fbshipit-source-id: 4f52a3ae2e5aefae372d8ea5188336216f910da3
2020-05-18 18:46:19 -07:00
f3048609d3 [CUDA] torch.roll for complex dtypes (#38664)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/38664

Test Plan: Imported from OSS

Differential Revision: D21630498

Pulled By: anjali411

fbshipit-source-id: bf43a812f3d8dd984785256bad41131410435965
2020-05-18 18:19:22 -07:00
724b2b6ebd Profiler: Call populate_cpu_children inside __str__ and fix typo (#37816)
Summary:
Fix https://github.com/pytorch/pytorch/issues/37500

I messed up with the old PR https://github.com/pytorch/pytorch/pull/37755 during rebasing and thus opened this one.

- Add call to `populate_cpu_children` for `__str__` to make sure that the printed result is correctly populated.
- Add test `test_profiler_aggregation_table`
- Fix a minor typo
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37816

Reviewed By: ilia-cher

Differential Revision: D21627502

Pulled By: ngimel

fbshipit-source-id: 9c908986b6a979ff08c2ad7e6f4afac1f5fbeebb
2020-05-18 16:47:13 -07:00
49d687f23c [JIT][to_backend] Move code that is not related to the user-facing API out of jit/backends/backend.h (#38567)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/38431

**Test Plan**
```
python test/test_jit.py TestBackends
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38567

Test Plan:
```
python test/test_jit.py TestBackends
```

Differential Revision: D21598950

Pulled By: jansel

fbshipit-source-id: 794436cf351f28ded9c3e13fbcf173aee6c33d42
2020-05-18 16:30:34 -07:00
76fc9bd2ef Docker constants refactor (#38676)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/38676

Differential Revision: D21629431

Pulled By: malfet

fbshipit-source-id: 17503c80ceb3aa30a45c29742e719226658e5ca5
2020-05-18 16:01:54 -07:00
2f21dfb541 [TensorExpr] Eager reduction initialization & removal from ReduceOp (#38585)
Summary:
This PR removes the deferred initializer field from ReduceOp in favour of eagerly initializing buffers when they are created (either in the constructor of `LoopNest`, or in `rfactor()`). This allows a pretty good simplification of reduction logic, removing almost all of the reduction expander and the ReduceInitCleaner & unpopular NoOp node added in the last fix.

Eager initialization is better for us anyway because it allows more opportunities to transform the initialization loop.

Added a few more tests, testReduceOverSplitWithTail failed before this change due to a bug in splitWithTail which now can't happen.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38585

Differential Revision: D21621551

Pulled By: nickgg

fbshipit-source-id: 378137e5723b4a6d6e390239efb12adce22a8215
2020-05-18 15:56:43 -07:00
97abed7cbe [quant] Remove TensorListObserver (#38584)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38584

All observers will support tensor lists in future PR

Test Plan: Imported from OSS

Differential Revision: D21623464

fbshipit-source-id: c5c57ecfe14f7c3aa92b7c99d724e846132ae03b
2020-05-18 15:49:34 -07:00
c430b7d80f Updating submodules
Summary:
GitHub commits:

d67d568565
6472821406
10a3d74cc4
10957464df
b2ab01c578
06943d59da
218b463647

Test Plan: n/a

Reviewed By: wittgenst

fbshipit-source-id: af0d8f75f6ee6ac9e67812fcd6206070f74f3f49
2020-05-18 15:01:46 -07:00
8c07a98adc Error out of default_collate for lists of unequal size (#38492)
Summary:
Fix issue https://github.com/pytorch/pytorch/issues/23141#

In the below example ```default_collate``` collates each element of the list. Since the second element isn't present in all samples, it is discarded:
```
from torch.utils.data import Dataset
from torch.utils.data import DataLoader
import numpy as np

class CustomDataset(Dataset):
    def __len__(self):
        return 2

    def __getitem__(self, idx):
        tmp = {
            "foo": np.array([1, 2, 3]),
            "bar": ["X"] * (idx+1),
        }

        return tmp

training = CustomDataset()

for batch in DataLoader(training, batch_size=2):
    print(batch)
```
Yields
```
{
  'foo': tensor(
    [
      [1, 2, 3],
      [1, 2, 3]
    ]
  ),
  'bar': [
      ('X', 'X'),
    ]
}
```

Based on discussion in the issue, it seems the best course of action is to error out in this case. This seems consistent with what is done for tensor elements, as seen in [TensorShape.cpp line 1066](https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/native/TensorShape.cpp#L1060) which is called when ```torch.stack``` is called. In this PR, I introduce a similar message to error out for lists.

SsnL
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38492

Differential Revision: D21620396

Pulled By: ezyang

fbshipit-source-id: 17f59fbb1ed1f0d9b2185c95b9ebe55ece701b0c
2020-05-18 14:53:33 -07:00
d7e08b456d FakeLowP Readme update (#38666)
Summary:
* fixed a typo source venv3/bin/active => source venv3/bin/activate
* Added instructions to set libomp.so path in LD_LIBRARY_PATH
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38666

Reviewed By: amylittleyang

Differential Revision: D21622582

Pulled By: yinghai

fbshipit-source-id: a286b1a25fea7de8b692bfba19e60978fcb3c215
2020-05-18 14:45:08 -07:00
378956b481 Make find_first_set works on x86 MSVC (#38637)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/38322#issuecomment-630031072.
Tested locally.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38637

Differential Revision: D21620059

Pulled By: ezyang

fbshipit-source-id: 50af50ce29e46759f11a196fa0fedca2740214bb
2020-05-18 14:40:10 -07:00
23207ae656 towards guard what you use (#38576)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/38576

Reviewed By: eellison

Differential Revision: D21608008

Pulled By: Krovatkin

fbshipit-source-id: c5f9783b6a7cdefe932d65502874cfc4fa650e3c
2020-05-18 14:28:03 -07:00
5fcb2f678f [Distributed Autograd] Make debugInfoMap from strings to ints (#38416)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38416

This diff primarily changes the `debugInfoMap` to map from strings to ints, instead of strings to strings. We were basically just converting these back to ints in Python so this avoid the extra conversions.

 `arc lint` also exposed tons of linting issues so fixing those here as well.

Test Plan: Build Bot - the tests already check whether the debugInfoMap is correct.

Differential Revision: D21266522

fbshipit-source-id: e742dec272bb1bab1bee01542610802922abab6b
2020-05-18 14:19:13 -07:00
5e2d8745c8 RIP CUDA <9.2: circleci, aten, and caffe2 (#36846)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/36846

Test Plan: Imported from OSS

Differential Revision: D21620850

Pulled By: ngimel

fbshipit-source-id: 7ad1676a12f86250f301095ffc6f365a3b370f34
2020-05-18 13:41:05 -07:00
b29e7f9b9d [TensorExpr] Use couldMoveBefore instead of couldMoveAfter checks in the fuser pass, add CPP tests. (#38592)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38592

I'm not sure that using couldMoveAfter was incorrect, but using
couldMoveBefore is more consistent with other subgraph-extraction
passes (old fuser, create autodiff graphs, etc.), so it would make it
easier to unify their implementations after this change.

Test Plan: Imported from OSS

Reviewed By: suo

Differential Revision: D21607856

Pulled By: ZolotukhinM

fbshipit-source-id: 970583af7859889d48aacf620ae028258e37a75f
2020-05-18 13:40:59 -07:00
895479e612 Complete codegen of 'build' workflow YAML tree (#38631)
Summary:
This replaces all "verbatim-sources" files comprising the workflow named 'build' in the CircleCI config with code generation.  This shall facilitate an automated conversion to workflow-per-job.

Note that the '.circleci/config.yml' file has some strictly cosmetic changes in this PR: some keys are sorted and inline comments are removed (moved to the Python modules that generate the config).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38631

Differential Revision: D21623528

Pulled By: kostmo

fbshipit-source-id: d86bd7aea979f443db14b4a3898220faad6bd0da
2020-05-18 13:38:53 -07:00
262f70c986 [PyTorch] Remove module and operator observer macros. (#38489)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38489

Remove module and operator observer macros.
ghstack-source-id: 104290763

Test Plan:
a. Verify that QPL is being sent while testing FB4A BI Cloaking:

{F236982877}

b. Verify that AI Benchmark is working on both module and operator level:
https://our.intern.facebook.com/intern/aibench/details/808056762618979

c. Verify that macosx segmentation effect by running buck run xplat/arfx/tracking/segmentation/tools:person_segmentation_demoAppleMac#macosx-x86_64:

{F236982853}

Reviewed By: ljk53

Differential Revision: D21540838

fbshipit-source-id: 516f84ef5673d4ceed38ae152440a5cbacc6ddaa
2020-05-18 13:28:01 -07:00
5b12c29b17 [ONNX]Update Dropout Export (#37641)
Summary:
Dropout operator in ONNX has additional input: training_mode.
Update Dropout export to match changes made in ONNX
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37641

Reviewed By: hl475

Differential Revision: D21613782

Pulled By: houseroad

fbshipit-source-id: f34d1a1f8116200c6609b4b43489d5610f6d0ec4
2020-05-18 13:10:44 -07:00
59d92e442b Vectorize non-persistent Softmax (#38557)
Summary:
Resubmit of https://github.com/pytorch/pytorch/issues/36485 with bug fix & enhanced testing.

Moved `test_softmax_backward` -> `test_softmax_results`, check fprop & bgrad against CPU implementation for all cases.

\cc ngimel
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38557

Differential Revision: D21620805

Pulled By: ngimel

fbshipit-source-id: 4f736b3e59f79142e1b982eb643c592dedcbe111
2020-05-18 13:05:36 -07:00
b2c06ad875 [JIT] Export all jit/backend headers in BUILD.bazel (#38668)
Summary:
**Summary**
This commit modifies `BUILD.bazel` to include all headers in
`jit/backends` in `torch_headers` so that they can be accessed
by external backend code that lives in a different repository.

**Test Plan**
Continuous integration.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38668

Differential Revision: D21623755

Pulled By: SplitInfinity

fbshipit-source-id: 7f77b70e056205444e5ae63b47d87d8791131c3c
2020-05-18 12:32:00 -07:00
eb224721d2 Enabled dropout removal pass in mobile optimizer. (#38254)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38254

ghstack-source-id: 103939143

Test Plan:
mobile optimizer test.
Also tested on pytext model.

Reviewed By: dreiss

Differential Revision: D21505862

fbshipit-source-id: 95c3b205e85cc7c4f20f5416c09cd0bc862849ce
2020-05-18 12:21:26 -07:00
09c430a2aa support complex types for tanh_backward_cpu (#37791)
Summary:
Closes: https://github.com/pytorch/pytorch/issues/37701

TO-DO:
* [x] Add Tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37791

Differential Revision: D21619827

Pulled By: anjali411

fbshipit-source-id: 0919ec80168a7f8b8092da8d39b8bc6f519d3440
2020-05-18 12:09:56 -07:00
a84fd8de39 Handling Active Call Count through Future Callback (#38589)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38589

This PR creates a unified way of decrementing the active call count on the client side by attaching a callback to the future returned by `TensorPipeAgent::send`.
ghstack-source-id: 104227074

Test Plan: CI/Sandcastle once tests PR's are merged.

Differential Revision: D21605779

fbshipit-source-id: c82396de6984876b09ee032ab1aa0f68a87005be
2020-05-18 11:59:38 -07:00
34ef473d92 [Tensorpipe Agent] Timeouts for RPC requests (#38448)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38448

This PR implements timeout support for RPCs, and respects the new per-RPC timeout functionality.

A map containing RPC futures, keyed by an expiration time, is populated by the send function for each RPC.

A separate watchdog thread polls this map and sets all incomplete futures with errors.
Note: we cannot set errors to a future with the lock held (this will trigger callbacks immediately and, if one of the callback functions tries to acquire the lock that we held when setting the error, we have a lock order cycle). Thus we add all incomplete futures to a list, and then iterate through the list outside the lock to set errors on those futures if necessary.
ghstack-source-id: 104227075

Test Plan: Will patch the testing diff on top of this to run tests.

Differential Revision: D21468526

fbshipit-source-id: 4514484ece6fb6be673427d44c7f3164ab3d9d7c
2020-05-18 11:59:32 -07:00
ece878e5b8 [ONNX] Add GreaterOrEqual and LessOrEqual to opset 12 ONNX export (#38311)
Summary:
GreaterOrEqual and LessOrEqual were added in opset 12, this PR adds support to export these operators to ONNX instead of using "not" and "less than" or "greater than".
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38311

Reviewed By: hl475

Differential Revision: D21613795

Pulled By: houseroad

fbshipit-source-id: 121d936d9787876ecb19cf24d661261e4abc82ab
2020-05-18 11:59:26 -07:00
ca05fb2e86 Add autograd tests for complex (#38658)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/38658

Test Plan: Imported from OSS

Differential Revision: D21622580

Pulled By: anjali411

fbshipit-source-id: 977f4a3074b09f72dd8bb6895edd3b6d3152b04b
2020-05-18 11:57:54 -07:00
44cead3a31 Improve syncbn doc format (#38423)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/38423

Differential Revision: D21601342

Pulled By: jerryzh168

fbshipit-source-id: dd2bf012831025495e9ece3db08536dd1d515645
2020-05-18 11:52:07 -07:00
59bef16138 Add ci binary test for windows (#38297)
Summary:
Tested in https://github.com/pytorch/pytorch/pull/38316.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38297

Differential Revision: D21623307

Pulled By: seemethere

fbshipit-source-id: 51f89f826f4add72d1f18456cad175a4f1724010
2020-05-18 11:50:17 -07:00
d904f3324f [NNPI] Support fp32 bias in NNPI Backend (#38596)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38596

ATT.

Test Plan:
unittests in the diff
```
buck test mode/dev //glow/fb/test/numerics:test_fc_nnpi_int8nnpi -- 'test_int8_fc_simple_fp32_bias \(glow\.fb\.test\.numerics\.test_fc_nnpi_int8\.Int8FCTest\)'
```

Reviewed By: jackm321

Differential Revision: D20474831

fbshipit-source-id: 9c49a71eb1926466013a196a3d6e60cdb25cf721
2020-05-18 11:45:41 -07:00
67cd263876 Fix merge_fp32_inputs_into_fp16 with no partition (#38594)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38594

By default, we don't have parition name, so previous impl will fail to rewire the input into the split-convert output. It's usually a hidden perf issue instead of a correctness issue.

Test Plan:
Enhanced
```
buck test glow/fb/test:test_merge_inputs_nnpi_fp16nnpi
```

Reviewed By: tracelogfb

Differential Revision: D21608439

fbshipit-source-id: d72b06500a3b84f6747aa77cf9fd8754a4ff1195
2020-05-18 11:45:35 -07:00
8338426ed8 Fix infinite loop bug in minimizer (#38507)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38507

With `--merge_fp32_inputs_into_fp16` we added some ops to the net with out net_pos, this makes the cardinality of blacklist pos smaller than number of op in the net. Previously, the updateInternalState() function of minimizer will just enter infinite loop. This diff fixed it by changing the loop condition.

Reviewed By: tracelogfb

Differential Revision: D21578777

fbshipit-source-id: 0d5373fa0a417ded1c80a2dc03248c07b1e0a320
2020-05-18 11:44:05 -07:00
e6993938de Avoid Releasing, Reacquiring lock per iteration in RPC Retry Thread (#38521)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38521

In the RPC Retry Thread, we add retriable futures to a list under the lock, release the lock, add callbacks/set errors to those futures, then re-acquire the lock to clean up the retry map. We can simply clean up the retry map before releasing the lock and not acquire it again - this would be cleaner and may results in better perf if this reduces context switching between threads looking to acquire the retryMapLock.
ghstack-source-id: 104062147

Test Plan: CI, there are thorough tests in the RPC framework to test errors with retries.

Differential Revision: D21563085

fbshipit-source-id: 35e620892da630d082c032f5f9ce16e8a9ffdfaa
2020-05-18 10:59:13 -07:00
711f258dc7 Enable tests in test_pytorch_onnx_onnxruntime (#37868)
Summary:
Enable tests in tests/onnx/test_pytorch_onnx_onnxruntime.py for:
- Einsum
- SoftmaxCrossEntropy
- NLLLoss
- normalize
- pixel_shuffle
- test_interpolate_no_shape
- test_arange_dynamic
- test_slice_neg_large_negone
 since there is a support in ORT for these operators.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37868

Reviewed By: hl475

Differential Revision: D21440528

Pulled By: houseroad

fbshipit-source-id: 4e590c554d000981bb12d4ce3ff4c175ed73a274
2020-05-18 10:50:03 -07:00
a86176dee2 CTC target un-pad example (#38393)
Summary:
CTC  target un-pad example
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38393

Differential Revision: D21620042

Pulled By: ezyang

fbshipit-source-id: 532d77c92fe1742c6f6b4a1b61c281f042cf8374
2020-05-18 10:31:14 -07:00
8292742ba0 fake_quant: move observer and fake_quant flags into buffers (#38368)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38368

There is a need for some customers to enable/disable these flags
in the middle of QAT.  To make it work properly with DDP,
we need to implement them using buffers so that they are replicated
properly to all the nodes.

This should solve issue https://github.com/pytorch/pytorch/issues/38081

Test Plan:
CI

Imported from OSS

Differential Revision: D21537607

fbshipit-source-id: 8c9da022beb7aaa44c658268f02f99dd5aee93fd
2020-05-18 09:30:07 -07:00
b27be3e0c5 Avoid double dispatch in logical_not for compilation speed reasons. (#38565)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38565

Also note this turns on "-Wno-unused-local-typedefs" because we are using dispatch macros for error checking.

Test Plan: Imported from OSS

Differential Revision: D21598478

Pulled By: gchanan

fbshipit-source-id: 28f9ad01bd678df0601a10d0daf3ed31c47c4ab2
2020-05-18 09:25:54 -07:00
176174a68b Remove BC hack (#38571)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/38571

Test Plan: Imported from OSS

Reviewed By: ZolotukhinM

Differential Revision: D21600325

Pulled By: jamesr66a

fbshipit-source-id: 9c1d53c2271702ad21cab07aafbd3bb16b474308
2020-05-16 19:51:42 -07:00
873f9025bb Updating submodules
Summary:
GitHub commits:

c160c61bde
09f8ebd98a
c881ebf415

Test Plan: n/a

Reviewed By: zpao

fbshipit-source-id: 986beb532d39dab1368e0099be84e1fc69072d46
2020-05-16 19:50:04 -07:00
fe44741dba Updating submodules
Summary:
GitHub commits:

2b1888a631
9f5ac27f50

Test Plan: n/a

Reviewed By: zpao

fbshipit-source-id: 1e17af89ad40dfeaf6ca662fb5cdfc89d1b76a99
2020-05-16 02:23:41 -07:00
83df3beaca Add complex support for torch.sum (#38382)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/38382

Test Plan: Imported from OSS

Differential Revision: D21600127

Pulled By: anjali411

fbshipit-source-id: c5338ab10bdcebe4a281b03f78e6f2063186bc32
2020-05-15 19:49:38 -07:00
db86c8c6f5 Test BC for built-in torchbind methods (#38560)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/38560

Test Plan: Imported from OSS

Reviewed By: ngimel

Differential Revision: D21598067

Pulled By: jamesr66a

fbshipit-source-id: 26a0e92a5c2883326be261cf84b7e916ebfd60d8
2020-05-15 19:06:59 -07:00
b9c537514c [JIT] Remove import statement thing in serialization docs (#38578)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/38578

Test Plan: Imported from OSS

Differential Revision: D21603383

Pulled By: jamesr66a

fbshipit-source-id: 07c7eb62f048406f2e21528e32c677d18eb87cce
2020-05-15 18:26:36 -07:00
feb24577c2 Reduce number of variables in codegen (#38369)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38369
Seems we have a lot of variables in codegen that carry duplicate information.
This PR is removing them. It unifies all use sites to use the same instance

ghstack-source-id: 104067031

Test Plan: waitforsandcastle

Differential Revision: D21537983

fbshipit-source-id: 8d3ce3d3f712f7ba355e8c192798dfefaf847dac
2020-05-15 17:58:45 -07:00
31b57e38cb [jit] fix index_put_ error in subscript assignment (#38378)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/27493
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38378

Test Plan: `pytest ./test/test_jit.py -k test_tensor_subscript_assign`

Differential Revision: D21540489

Pulled By: jansel

fbshipit-source-id: a06e55175942b9d51ccc51d5440b7b122481b333
2020-05-15 17:53:27 -07:00
f39222a13d Restore thread_local states in continuation thread on RPC servers (#38512)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38512

As we gradually making the RPC non-blocking on server side, the
processing of the same request can yield-run on different threads.
Hence, we need to populate thread_local states (e.g., ctx id) in
the continuation thread.

Fixes #38439

Test Plan: Imported from OSS

Differential Revision: D21583642

Pulled By: mrshenli

fbshipit-source-id: a79bce1cb207fd11f1fa02b08465e49badda65fc
2020-05-15 17:23:04 -07:00
8752d6a736 DOC: Correct upsample doc to match interpolation (#38455)
Summary:
Fix https://github.com/pytorch/pytorch/issues/38334 and correct the docs of `torch.nn.functional.upsample`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38455

Differential Revision: D21583515

Pulled By: driazati

fbshipit-source-id: 6ac5a79ba489bdcdd3fab34e4eddb4864e20a29e
2020-05-15 17:09:26 -07:00
8743d51182 Updating submodules
Summary:
GitHub commits:

b9f1a803de

Test Plan: n/a

Reviewed By: zpao

fbshipit-source-id: 131edb2b9b26117e2b9a9d34a91305251b04e7e2
2020-05-15 17:04:21 -07:00
0d9bb5f580 [JIT] Use GE optimizer guard in import (#38575)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/38575

Test Plan: Imported from OSS

Reviewed By: suo

Differential Revision: D21603086

Pulled By: jamesr66a

fbshipit-source-id: 4427efabfdcc449045485e1cd9c2740ea823cc9c
2020-05-15 16:57:42 -07:00
67d76f6bdd Add utility to enable cpp stacktraces in torch.utils.debug (#38127)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/38127

Test Plan: Imported from OSS

Differential Revision: D21595298

Pulled By: albanD

fbshipit-source-id: 3926336cea2eaa0ef50bf9bfffd6c07f239d753f
2020-05-15 16:49:16 -07:00
87f40fef84 Refactor check macros to reuse code (#38126)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/38126

Test Plan: Imported from OSS

Differential Revision: D21595300

Pulled By: albanD

fbshipit-source-id: 53805053a7a1ad35e93f335e889718e699a5dce1
2020-05-15 16:49:03 -07:00
adf67b81c5 Make strip error messages work for cuda code (#38125)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/38125

Test Plan: Imported from OSS

Differential Revision: D21595299

Pulled By: albanD

fbshipit-source-id: 1eec45d32afbd3d09d71d7cb155b8e69f4ba496b
2020-05-15 16:47:15 -07:00
9cfc10d52e Updates assertEqual to use torch.isclose-like logic (#37294)
Summary:
Edit: this has been updated to reflect the PR's current status, which has changed after review.

This PR updates the behavior of the assertEqual, assertNotEqual, and assert_allclose to be consistent with each other and torch.isclose. It corrects several additional bugs in the current implementations and adds extensive testing and comments, too.

These updates follow from changes to assertEqual like https://github.com/pytorch/pytorch/pull/34258 and https://github.com/pytorch/pytorch/pull/37069, and from our discussion of torch.isclose for complex tensors (see https://github.com/pytorch/pytorch/issues/36462), where we decided to implement a NumPy-compatible mathematical notion of "closeness" for complex tensors that is not a great fit for our testing framework.

The detailed changelist is:

- New test framework functions for comparing tensors and scalars
  - Tensors are compared using isclose; the real and imaginary parts of complex tensors are compared independently
  - Scalars are compared using the same algorithm
  - assertEqual and assert_allclose now use this common comparison function, instead of each implementing their own with divergent behavior
  - assertEqual-like debug messages are now available for all tensor and scalar comparisons, with additional context when comparing the components of sparse, quantized, and complex tensors
- Extensive testing of the comparison behavior and debug messages
- Small Updates
  - assertEqual now takes an "exact_device" argument, analogous to "exact_dtype", which should be useful in multidevice tests
  - assertEqual now takes an "equal_nan" argument for argument consistency with torch.isclose
  - assertEqual no longer takes the "allow_inf" keyword, which misleadingly only applied to scalar comparisons, was only ever set (rarely) to true, and is not supported by torch.isclose
- Bug fixes:
  - the exact_dtype attribute has been removed (no longer needed after https://github.com/pytorch/pytorch/pull/38103)
  - message arguments passed to assertEqual are now handled correctly
  - bool x other dtype comparisons are now supported
  - uint8 and int8 tensor comparisons now function properly
  - rtol for integer comparisons is now supported (default is zero)
  - rtol and atol for scalar comparisons are now supported
  - complex scalar comparisons are now supported, analogous to complex tensor comparisons
  - assertNotEqual is now equivalent to the logical negation of assertEqual
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37294

Differential Revision: D21596830

Pulled By: mruberry

fbshipit-source-id: f2576669f7113a06f82581fc71883e6b772de19b
2020-05-15 16:24:03 -07:00
6a23214a47 [JIT] Adjust pybind includes in backend.h (#38562)
Summary:
**Summary**
This commit adjusts the `pybind` includes in `backend.h` so
that we can avoid exporting some unrelated headers during install (which
probably shouldn't be exposed anyway). In addition, the headers that this commit
removes are not used.

**Test Plan**
Continuous integration (includes tests for JIT backends).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38562

Differential Revision: D21601694

Pulled By: SplitInfinity

fbshipit-source-id: c8f8103d24cb4f10d9eb6b3657eed75878078945
2020-05-15 16:01:22 -07:00
b04c07a67c Added a Resource section to README (#38547)
Summary:
Added the following entries in the newly made resources section in README:

* [PyTorch.org](https://pytorch.org/)
* [PyTorch Tutorials](https://pytorch.org/tutorials/)
* [PyTorch Examples](https://github.com/pytorch/examples)
* [PyTorch Models](https://pytorch.org/hub/)
* [Intro to Deep Learning with PyTorch from Udacity](https://www.udacity.com/course/deep-learning-pytorch--ud188)
* [Intro to Machine Learning with PyTorch from Udacity](https://www.udacity.com/course/intro-to-machine-learning-nanodegree--nd229)
* [Deep Neural Networks with PyTorch from Coursera](https://www.coursera.org/learn/deep-neural-networks-with-pytorch)
* [PyTorch Twitter](https://twitter.com/PyTorch)
* [PyTorch Blog](https://pytorch.org/blog/)
* [PyTorch YouTube](https://www.youtube.com/channel/UCWXI5YeOsh03QvJ59PMaXFw)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38547

Differential Revision: D21601647

Pulled By: jerryzh168

fbshipit-source-id: 2453312401386aa59c3b6c62b9f735dc8eb4947f
2020-05-15 15:54:10 -07:00
53a368fedd [aten] Split some at::launch code into at::internal::launch_no_thread_state() (#38477)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38477

A few specific uses (e.g. Thrift rpc parsing) don't need source thread
state to be copied over. In microbenchmarks, this seems to add ~500ns,
so split code across functions, so some code can use directly.
ghstack-source-id: 104190095

Test Plan:
- Existing code using at::launch exercises this codepath, so buck test mode/dev-nosan caffe2/test/...
 - For the split version, primarily the Thrift-based change layered on top of this.

Differential Revision: D21573168

fbshipit-source-id: 2ef1f196b5177634d4ee7fdca7371d36906a69d6
2020-05-15 15:06:23 -07:00
6232481cab [quant][graphmode] Add RemoveReduantDequantize pass (#38434)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38434

We insert dequantize for each use in order to produce quantization patterns that will
later be fused, after that we should also remove extra dequantize node produced by this operation.

Test Plan: Imported from OSS

Differential Revision: D21597834

fbshipit-source-id: 18dfb2760bbb08932aa4e1d06f96cfc5fb37ed88
2020-05-15 15:01:40 -07:00
dd7eed5ae4 [JIT] Export JIT backend extension headers in setup.py (#38525)
Summary:
**Summary**
This commit adds the headers required to define and use JIT backends to
`package_data` in `setup.py` so that they are exported and copied to the
same place as the rest of the headers when PyTorch is installed.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38525

Differential Revision: D21601806

Pulled By: SplitInfinity

fbshipit-source-id: 1615dd4047777926e013d7dd14fe427d5ffb8b70
2020-05-15 14:45:08 -07:00
1d1533e358 Migrate CPU cross and some elementwise to c10::complex (#38023)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/38023

Test Plan: Imported from OSS

Differential Revision: D21518304

Pulled By: anjali411

fbshipit-source-id: a7d7bad7ba0af66314a5e91608add32d36695e6b
2020-05-15 14:33:41 -07:00
dc918162b7 Remove Caffe2_MAIN_LIBS (#38408)
Summary:
Right now it is an unused alias to `torch_library` interface library
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38408

Differential Revision: D21598250

Pulled By: malfet

fbshipit-source-id: ec9a2446b94e7ea68298831212005c2c80bbc95c
2020-05-15 12:27:15 -07:00
daa85cfe2e [JIT] Exit Transform Rewrite (#38282)
Summary:
After an early return, we conditionalize all further execution. This means that currently the pattern of
`if return elif return elif return` generates better code than `if return if return if return`. It's obviously not good to have semantically equivalent code generate worse IR, so we should rewrite the graph to handle this case. This came up in https://github.com/pytorch/pytorch/pull/37171

```
torch.jit.script
def test_foo(x: bool, y: bool):
    if x:
        return 1
    return 2
print(test_foo.code)
```
generates:
```
def test_foo(x: bool,
    y: bool) -> int:
  _0 = uninitialized(int)
  if x:
    _1, _2 = True, 1
  else:
    _1, _2 = False, _0
  if _1:
    _3 = _2
  else:
    _3 = 2
  return _3
```
while
```
torch.jit.script
def test_foo(x: bool, y: bool):
    if x:
        return 1
    else:
        return 2
print(test_foo.code)
```
generates:
```
def test_foo(x: bool,
    y: bool) -> int:
  if x:
    _0 = 1
  else:
    _0 = 2
  return _0
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38282

Differential Revision: D21576733

Pulled By: eellison

fbshipit-source-id: 80cf1ad7fbda6d8d58557abbfb21c90eafae7488
2020-05-15 12:22:28 -07:00
62afc2d63d [JIT] Remove debug print statement added in #37994 (#38524)
Summary:
**Summary**
This commit removes a print statement added in https://github.com/pytorch/pytorch/issues/37994 that appears to
be for debugging and was most likely not intended to be commited.

**Test Plan**
Continuous integration.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38524

Differential Revision: D21587268

Pulled By: SplitInfinity

fbshipit-source-id: 6bdcdce647c45f5c0a2ba179a3545a1c0cae1492
2020-05-15 12:01:34 -07:00
d44573a6dc Remove _all_weight_values again (#38504)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/38504

Test Plan: Imported from OSS

Differential Revision: D21579530

Pulled By: jamesr66a

fbshipit-source-id: 4449c92142200eaadc68b59d6f5f964ba60b1c80
2020-05-15 11:55:09 -07:00
8d7582a2cf codegen mobile and macos configs (#38539)
Summary:
Second round of conversion to generated CircleCI configs
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38539

Differential Revision: D21596722

Pulled By: kostmo

fbshipit-source-id: 448eca382f7f108d1c8b45df419429423c3b248f
2020-05-15 10:56:06 -07:00
70ef9f5124 Improve testing of logical_not. (#38505)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38505

This takes the testing of https://github.com/pytorch/pytorch/pull/38275, but doesn't include the kernel changes which are still being worked out.

Test Plan: Imported from OSS

Reviewed By: mruberry

Differential Revision: D21580574

Pulled By: gchanan

fbshipit-source-id: f12317259cb7373989f6c9ad345b19aaac524851
2020-05-15 10:51:35 -07:00
42a3fb3a4e change to find_method of lite_interpreter API to return nullptr if method not found (#38503)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38503

Modify find_method to not error out if method doesn't
exist to be more similar to:
https://github.com/pytorch/pytorch/blob/master/torch/csrc/jit/api/object.h#L100

Test Plan: run functions as part of mobile ASR

Reviewed By: iseeyuan

Differential Revision: D21466638

fbshipit-source-id: 635bff32539f7495f68dd3b203aaeb108f6283da
2020-05-15 10:33:16 -07:00
5a19fe7454 migrate gather to ATen (CUDA) (#37659)
Summary:
Fixes [https://github.com/pytorch/pytorch/issues/24567](https://github.com/pytorch/pytorch/issues/24567).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37659

Differential Revision: D21504432

Pulled By: ngimel

fbshipit-source-id: baeb464a511236b01b69a7dddd4a3db268cd799c
2020-05-15 10:26:59 -07:00
52e9953faf use version number instead of 'master' in html header title (#38149)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38149

This is for (#21290) (#31894)

Instead of putting "Pytorch master documentation" in header's html title, now we use "Pytorch 1.x.x documentation", this is similar to tensorFlow and numpy doc page.

In google search, we will get
Pytorch Documentation - Pytorch 1.x.x Documentation instead.

Test Plan: Imported from OSS

Differential Revision: D21586559

Pulled By: glaringlee

fbshipit-source-id: 2995709ac3c22dbb0183b5b4abfde7d795f1f8eb
2020-05-15 08:32:32 -07:00
4b52e52577 Use jit_core_sources from build_varliables.bzl (#38526)
Summary:
Replace hardcoded filelist in aten/src/ATen/CMakeLists.txt with one from `jit_source_sources`
Fix `append_filelist` to work independently from the location it was invoked
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38526

Differential Revision: D21594582

Pulled By: malfet

fbshipit-source-id: c7f216a460edd474a6258ba5ddafd4c4f59b02be
2020-05-15 08:21:37 -07:00
242af6c078 Add tan_cuda for complex dtypes (#38400)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38400

* #38399 Added autograd tests, disabled jit autograd tests for complex and added a separate list for tests for complex dtype only

Test Plan: Imported from OSS

Differential Revision: D21572209

Pulled By: anjali411

fbshipit-source-id: 7036029e9f8336139f5d54e0dfff9759f3bf8376
2020-05-15 08:16:59 -07:00
acacad2575 Adding support for manifold files in DBReader (#37727)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37727

Check if the file exists locally only for `log_file_db` db_type. Reader files in other `db_type` like `manifold_log_file_db` are excluded from this check.

Test Plan: Verified that files stored in manifold can be loaded using `DBFileReader`.

Reviewed By: hbjerry

Differential Revision: D21329671

fbshipit-source-id: bbc0e88851783ca3f78f7c61bfe84b480c09b5ac
2020-05-15 07:18:30 -07:00
bae895cef0 Issue 37819: Added check for kHIP in ATen/native/Copy.cpp (#38003)
Summary:
Fixed https://github.com/pytorch/pytorch/issues/37819
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38003

Differential Revision: D21533134

Pulled By: mruberry

fbshipit-source-id: 97490a8729171b95b103e00780e36518b9865087
2020-05-15 01:40:48 -07:00
bf2bbd9648 Add message to static_assert (#38519)
Summary:
From standard https://en.cppreference.com/w/cpp/language/static_assert, static_assert without message is not supported on C++ 14. Some compilers complained about this.

cc mcarilli
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38519

Differential Revision: D21589194

Pulled By: ezyang

fbshipit-source-id: d01555b2861703f0326a99bc5162e124695b0624
2020-05-15 00:26:49 -07:00
c0bc182761 Revert "Vectorize non-persistent Softmax kernels (#36485)" (#38534)
Summary:
This reverts commit c879c6fb98ec38197c86c703e1011c8b94f14c59.
(it produces incorrect results)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38534

Reviewed By: soumith

Differential Revision: D21589251

Pulled By: ngimel

fbshipit-source-id: 66d5324848d0245d15b7ef5f1fe4302ed0992b56
2020-05-14 23:17:59 -07:00
8bf3124572 [TensorExpr] Fix bug when splitting inner reduce axis with tail (#38420)
Summary:
Fixes a bug in the following code:
```
    Tensor* c = Reduce("sum", {{10, "m"}}, Sum(), b, {{10, "n"}, {10, "k"}});
    // split N loop with tail:
    loop.splitWithTail(loop.getLoopStmtsFor(c)[1], 8, &outer, &inner, &tail);
```

When this is expanded there are two ReduceOps:

```
for (int m = 0; m < 10; m++) {
  for (int n_outer = 0; n_outer < (10 - 0) / 8; n_outer++) {
    for (int n_inner = 0; n_inner < 8; n_inner++) {
      for (int k = 0; k < 10; k++) {
        sum[m] = ReduceOp(sum, float(0), (sum[m]) + (b[m, n_outer * 8 + n_inner, k]), out_args={m}, reduce_args={n_inner, n_outer, k});
      }
    }
  }
  for (int n_tail = 0; n_tail < (10 - 0) % 8; n_tail++) {
    for (int k = 0; k < 10; k++) {
      sum[m] = ReduceOp(sum, float(0), (sum[m]) + (b[m, n_tail + ((10 - 0) / 8) * 8, k]), out_args={m}, reduce_args={n_tail, k});
    }
  }
}
```

But each ReduceOp will expand it's initializer, which in this case will overwrite the sum of the split loop:

```
for (int m = 0; m < 10; m++) {
  sum[m] = 0.f;
  for (int n_inner = 0; n_inner < 8; n_inner++) {
    for (int k = 0; k < 10; k++) {
      sum[m] = (sum[m]) + (b[(100 * m + k) + 10 * n_inner]);
    }
  }
  sum[m] = 0.f;          <------- *HERE*
  for (int n_tail = 0; n_tail < 2; n_tail++) {
    for (int k = 0; k < 10; k++) {
      sum[m] = (sum[m]) + (b[((100 * m + k) + 10 * n_tail) + 80]);
    }
  }
}
```

The simplest fix is to remove the initializer from the tail loop, which requires adding support for Reductions without an initializer (I did via adding a NoOp Expr rather than handling nullptr). Also moved the ReductionExpander from loopnest.cpp to reduction.h as loopnest is getting a bit heavy.

Added tests for all kinds of splits on a simple 3D reduction to verify no more problems of this type.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38420

Differential Revision: D21587583

Pulled By: nickgg

fbshipit-source-id: e0766934481917007119612eb60cc76c3242e44a
2020-05-14 22:58:28 -07:00
0d51728d38 Updating submodules
Summary:
GitHub commits:

1b4b90a028
17b31be012

Test Plan: n/a

Reviewed By: zpao

fbshipit-source-id: 504a9afa39116184bd7d211698b406d935116715
2020-05-14 22:54:11 -07:00
3cb2778d94 Remove some unnecessary cast for complex numbers. (#38422)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38422

This partially reverts #38021, due to the availability of #38418

Test Plan: Imported from OSS

Differential Revision: D21587201

Pulled By: malfet

fbshipit-source-id: c0717303c842ceb3a202986ec0e808ed45f682f1
2020-05-14 22:25:24 -07:00
000fea375c Support operations on c10::complex and integer scalars (#38418)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38418

This is useful in reducing verbosity in c10::complex's general usage, and potentially also offers
performance benefits.

This brings back #34506 (which was made for std::complex).

Differential Revision: D21587012

Test Plan: Imported from OSS

Pulled By: malfet

fbshipit-source-id: 6dd10c2f417d6f6d0935c9e1d8b457fd29c163af
2020-05-14 22:23:14 -07:00
ac613371a3 Update NNPI backend to 0.5.2.5. (#4464)
Summary:
Update of NNPI Backend to v0.5.2.5.
Pull Request resolved: https://github.com/pytorch/glow/pull/4464

Reviewed By: arunm-git

Differential Revision: D21418023

Pulled By: hl475

fbshipit-source-id: 254fcbca28bce0cfc37672306db7f9a352423d18
2020-05-14 22:15:17 -07:00
ec9b2f9a9d [quant][graphmode][refactor] Factor out getFixedQParamOpFusionInfo (#38359)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/38359

Test Plan: Imported from OSS

Differential Revision: D21559807

fbshipit-source-id: 13a67049a189ca43dcdae4b42bab0847821b3cd5
2020-05-14 21:37:59 -07:00
960f4b51e3 [JIT] Fix @staticmethod access from self on modules (#37702)
Summary:
Closes https://github.com/pytorch/pytorch/issues/30755
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37702

Differential Revision: D21389989

Pulled By: voznesenskym

fbshipit-source-id: f9b7e26a9eab7dc3d7762a5a28f85424dac5fbb3
2020-05-14 21:12:10 -07:00
3d0532f3ab [c2] fix compute_norm test (#38529)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38529

(Note: this ignores all push blocking failures!)

Test Plan: buck test mode/opt //caffe2/caffe2/python/modeling:compute_norm_for_blobs_test

Reviewed By: olittle

Differential Revision: D21588603

fbshipit-source-id: bdb0ae455e85a934cb5e369fbb0078f2ff842814
2020-05-14 20:49:36 -07:00
8df14c573e Add sccache support for hcc and hip-clang in ROCm (#38451)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/38451

Differential Revision: D21589126

Pulled By: ezyang

fbshipit-source-id: dc4d08e7f393dbe369e501334c776071b2c176e0
2020-05-14 20:44:20 -07:00
fac9f36563 Back out "[c2] register cuda op for LpNorm (fallback)"
Summary: Original commit changeset: 573419e5a8da

Test Plan: D21562485  breaks CI build. Unlanding

Reviewed By: olittle

Differential Revision: D21588831

fbshipit-source-id: 6dda4b71904d7765f32f570f9722e4a9a6cbc97b
2020-05-14 20:25:30 -07:00
ee52501976 [quant][graphmode][refactor] Factor out getInputTensorQParamOpFusionInfo (#38358)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/38358

Test Plan: Imported from OSS

Differential Revision: D21559806

fbshipit-source-id: b243b811c5c5917f50a11ef5b26174baf46e683f
2020-05-14 19:59:09 -07:00
155a287aea Enforce const on PyRRef functions (#38415)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/38415

Test Plan: Imported from OSS

Reviewed By: rohan-varma

Differential Revision: D21554722

Pulled By: mrshenli

fbshipit-source-id: 53c2abd8de43545873be486e1fb893bc329d65a1
2020-05-14 19:01:28 -07:00
25177e2796 [quant] Support empty batch input for quantized ops (#38508)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/38508

Test Plan:
python test/test_quantization.py TestQuantizedOps.test_empty_batch

Imported from OSS

Differential Revision: D21581937

fbshipit-source-id: e50580dec0682848a0703f7bdee6e9351ab79814
2020-05-14 18:42:50 -07:00
bc49d938e2 Revert D21585458: [pytorch][PR] [RELAND] .circleci: Improve docker image build workflow
Test Plan: revert-hammer

Differential Revision:
D21585458

Original commit changeset: 37792a1e0f5e

fbshipit-source-id: cd4c6794708f27a80077e0af27ccf52c5c6ba832
2020-05-14 18:11:03 -07:00
0e0b9496fe [c2] [easy] stop gradient when diagnose (#38518)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38518

as title

Test Plan: buck test

Reviewed By: olittle

Differential Revision: D21562570

fbshipit-source-id: 3a2e8dea3d821a2bdb9f30db25816a2bfa6c5dcf
2020-05-14 17:30:39 -07:00
8cdc4807cd [RELAND] .circleci: Improve docker image build workflow (#38484)
Summary:
closes https://github.com/pytorch/pytorch/issues/37855

Relies on https://github.com/pytorch/pytorch/pull/38483

Previous attempts to get this right:
* https://github.com/pytorch/pytorch/pull/38335
* https://github.com/pytorch/pytorch/pull/38279
* https://github.com/pytorch/pytorch/pull/37976

This reverts commit 80639604a82422e314890f154242202a43d264f9.

Improves the docker image build workflow from many steps to basically
transparent from a user's perspective.

To update docker images now all one has to do is edit the
.circleci/docker folder and it will update automatically and also
dynamically add the tags to the list of tags to keep from the garbage
collector.

Adding a new image will currently stay the same but we can explore doing
that dynamically as well.

How the build workflow works:
  - Docker tags are determined by the hash defined from git for the
    .circleci/docker sub-directory (extracted using git rev-parse)
  - Images are only built if the computed hash is not found in ecr and
    the hash is different than the previously computed hash. The
    previously computed hash is found using the same process as before
    but subbing out HEAD for the merge base between HEAD and the base
    git revision
  - That tag is then passed through the jobs using a shared workspace
    which is added to downstream jobs using the circleci ${BASH_ENV}

How the new garbage collection works:
  - Tags to keep are generated by stepping through all of the commits in
    in the .circleci/docker subdirectory

Signed-off-by: Eli Uriegas <eliuriegas@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38484

Differential Revision: D21585458

Pulled By: seemethere

fbshipit-source-id: 37792a1e0f5e5531438c4ae61507639c133aa76d
2020-05-14 17:11:04 -07:00
bbfd0ef244 [c2] register cuda op for LpNorm (fallback) (#38517)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38517

as title

Test Plan: buck test

Reviewed By: olittle

Differential Revision: D21562485

fbshipit-source-id: 573419e5a8dae4121d99d5b72ed3960a92db7a54
2020-05-14 16:54:12 -07:00
504637a171 [quant][graphmode] Support ops with fixed quantization parameters (#38278)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38278

Support ops like aten::hardsigmoid that has a fixed quantization parameters:
```
  constexpr float o_scale = 1.0f / 256.0f;
  constexpr int32_t o_zero_point = 0;
```

Ops supported:
- hardsigmoid
- sigmoid
- tanh

Test Plan: Imported from OSS

Differential Revision: D21559811

fbshipit-source-id: 26f3c9c3389dea4f07b350172e2974fac8c5c470
2020-05-14 16:36:06 -07:00
de7025fbdb [quant] Support for functional quantized::conv1d (#38449)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38449

Also update docs to reflect conv1d op support

Test Plan:
python test/test_quantization.py TestQuantizedFunctional.test_conv1d_api

Imported from OSS

Differential Revision: D21575921

fbshipit-source-id: 21c9f6b49ad456cd9d93e97f17cf5b8d87f0da6b
2020-05-14 16:09:51 -07:00
8e732514cd [quant][graphmode] Add support for quantized conv1d + relu fusion (#38441)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/38441

Test Plan:
python test/test_quantization.py test_quantized_conv1d_relu

Imported from OSS

Differential Revision: D21575919

fbshipit-source-id: d43e33052ce1be5e38acef8fac16f22cb11c0695
2020-05-14 16:09:46 -07:00
f4605ae5c3 [quant] Fusion support for conv1d + ReLU (#38438)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38438

Fusion for PTQ flow in eager mode. Graph mode to follow

Test Plan:
python test/test_quantization.py TestFusion

Imported from OSS

Differential Revision: D21575920

fbshipit-source-id: 5bac6602520f42ae3f4957d1a55e6a863daa0257
2020-05-14 16:08:11 -07:00
8b6bf2a457 Add C++ Landing Page (#38450)
Summary:
* Add cpp_index.rst for landing page to match 1.5 (https://github.com/pytorch/pytorch/blob/release/1.5/docs/source/cpp_index.rst)
* Link to new cpp landing page was added to the docs table of contents in this PR: https://github.com/pytorch/pytorch/pull/38350
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38450

Differential Revision: D21580939

Pulled By: jlin27

fbshipit-source-id: 021c43f207a100d554266e4e16cb6752ca9c56a0
2020-05-14 16:02:01 -07:00
1f87f15ba3 Remove _reset_warning_registry (#38485)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38485

Python 2 has reached end-of-life and is no longer supported by PyTorch.
This class does nothing in Python 3.

Test Plan: CI

Reviewed By: ailzhang

Differential Revision: D21575260

Pulled By: dreiss

fbshipit-source-id: 184696c9fa501e8d2517950b47cdbc90b2ae8053
2020-05-14 15:03:30 -07:00
b140ed6848 Remove structseq_slice (#35625)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35625

Python 2 has reached end-of-life and is no longer supported by PyTorch.
This function was already ifdef'ed out in Python 2.

Added a comment about when we might be able to remove this entire file.

Test Plan: CI

Differential Revision: D20842885

Pulled By: dreiss

fbshipit-source-id: 1fd3b1b2ff5a82caaf3bc11344dde2941427cfc0
2020-05-14 15:03:24 -07:00
6d642a6f6c Remove (most) Python 2 support from C++ code (#35614)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35614

Python 2 has reached end-of-life and is no longer supported by PyTorch.
Now we can clean up a lot of cruft that we put in place to support it.
These changes were all done manually, and I skipped anything that seemed
like it would take more than a few seconds, so I think it makes sense to
review it manually as well.

Test Plan: CI

Differential Revision: D20842876

Pulled By: dreiss

fbshipit-source-id: 18abf0d324ed2185ec6d27c864e935d856dcc6ad
2020-05-14 15:01:49 -07:00
1b973aa2a2 Sort CirlceCI config.yml keys to facilitate diff review after codegen (#38496)
Summary:
This will support another round of migration from hand-written configs to code generation.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38496

Differential Revision: D21581624

Pulled By: kostmo

fbshipit-source-id: aed814ef6d4fc6af9ce092727b2dacc99de14ae0
2020-05-14 14:33:25 -07:00
69dca43c35 Updating submodules
Summary:
GitHub commits:

64bad39e0d
d10385b2cf
d1d606ea75
5b7309d5fe
e5c84d203b
6e64791678
b9a2b343c4
d9c1059140
1bcde534b5
46981b8186

Test Plan: n/a

Reviewed By: zpao

fbshipit-source-id: 5299cef9c91612f176ff0c29d0cc3acf629d2240
2020-05-14 14:15:22 -07:00
0e80c12bb4 [pytorch] fix -Wlogical-op-parentheses in SortingKthValue.cu (#38500)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38500

Reported  by Clang:
```
caffe2/aten/src/ATen/native/cuda/SortingKthValue.cu:77:56: error: '&&' within '||' [-Werror,-Wlogical-op-parentheses]
                    || THCNumerics<scalar_t>::isnan(v) && THCNumerics<scalar_t>::isnan(kValue));
                    ~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
caffe2/aten/src/ATen/native/cuda/SortingKthValue.cu:77:56: note: place parentheses around the '&&' expression to silence this warning
                    || THCNumerics<scalar_t>::isnan(v) && THCNumerics<scalar_t>::isnan(kValue));
                                                       ^
                       (                                                                      )
```

Test Plan:
```
buck build mode/opt -c fbcode.cuda_use_clang=true fblearner/flow/projects/dper:workflow
```

Reviewed By: ngimel

Differential Revision: D21578871

fbshipit-source-id: 83595152a370a4acbb2c3b5823dbae9c21485f06
2020-05-14 13:59:31 -07:00
9d0e935b48 skip torchbind on rocm (#38501)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/38501

Test Plan: Imported from OSS

Differential Revision: D21579298

Pulled By: suo

fbshipit-source-id: 4ac0b6beac26c97c1e0ff68304996ce62be8e8ce
2020-05-14 12:58:27 -07:00
4d4895a62a Use Future's then() API to fix RPC profiling (#38352)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38352

Fixes the RPC profiling by using the `then()` API added in https://github.com/pytorch/pytorch/pull/37311. Instead of adding a regular callback, we return a new future that completes when the profiling callback is finished. This is transparent to the user as the future still completes with the value of the original future (i.e. the RPC's return value)

To make this work for RRef, we add a `_set_profiling_future` to set the profiling future, and `_get_profiling_future` to retrieve this future and wait on it in the tests.

Re-enabled profiling tests and stress tested them 1000 times to verify the fix
ghstack-source-id: 104086114

Test Plan: Re-enabled profiling tests

Differential Revision: D21506940

fbshipit-source-id: 35cde22f0551c825c9bc98ddc24cca412878a63a
2020-05-14 12:52:45 -07:00
f178bf10f1 Support rpc_async call with timeout in JIT (#37884)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37884

Adds support to use rpc_timeout param in rpc_async call from jit for
parity with eager mode. Done by:
1) Add timeout as an input in ir_emitter.cpp if it is specified
2) Parse float IValue from inputs in `prim::rpc_async` operator. Give the default if needed.

Added UTs in jit/rpc_test.
ghstack-source-id: 104083031

Test Plan: Added UTs in jit/rpc_test.

Differential Revision: D21268895

fbshipit-source-id: 34bb10a2ac08b67dd6b789121ab43e2c0e696229
2020-05-14 12:44:26 -07:00
3300dd5227 .cirlceci: Keep tags that look like a sha1 (#38483)
Summary:
Previous attempts to get this right:
* https://github.com/pytorch/pytorch/pull/38335
* https://github.com/pytorch/pytorch/pull/38279
* https://github.com/pytorch/pytorch/pull/37976

This tag kept getting deleted before the docker image ci workflow could
be merged causing it to have upstream breakages.

It'd be best to make sure the garbage collector just doesnt garbage
collect it.

This is a pre-step to merge https://github.com/pytorch/pytorch/pull/38484

Signed-off-by: Eli Uriegas <eliuriegas@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38483

Differential Revision: D21577359

Pulled By: seemethere

fbshipit-source-id: c4e0709bd8fff8f24a988b60eaa9f8c01576ef2f
2020-05-14 12:38:33 -07:00
38d141ede5 Support having a different forward method when we are not in scripting mode (#38158)
Summary:
TorchScript currently doesn’t support `*args, **kwargs` in method signature, which is extensively used in DPER3 low-level modules’ forward method. In order to make DPER3 low-level modules scriptable, I was thinking about a solution of having a forward method *only* for TorchScript, and replace the forward method when we are not in scripting mode.

This solution works today, and I would like to add a test to make sure it will always work in the future.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38158

Differential Revision: D21485657

Pulled By: yf225

fbshipit-source-id: df7368e8a5265418be7c305e6666ffd76e595466
2020-05-14 12:13:06 -07:00
5f2a274015 Fix conv non zero padding being applied in wrong dim (#37881)
Summary:
Turns out F.pad takes in dims in reverse order. Fixes https://github.com/pytorch/pytorch/issues/37844
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37881

Differential Revision: D21554011

Pulled By: soumith

fbshipit-source-id: a85a7f6db9f981d915728965903c5c57b6617c93
2020-05-14 11:56:38 -07:00
b57a339703 Guard against negative rpcTimeout being passed in to RpcBackendOptions (#38267)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38267

Assert that the rpcTimeout is positive in RpcBackendOptions
constructor
ghstack-source-id: 104029918

Test Plan: CI

Differential Revision: D21509850

fbshipit-source-id: c925490e3d8fa2ffa42b0ae1170ca2f740af11f7
2020-05-14 11:33:23 -07:00
d1eeb3b7bb [Tensorexpr] Fix and improve handling multiple gpu devices (#38365)
Summary:
These commits fixes a bug which was exposed when we took away the fallback path. The fix is to set the appropriate device before setting CUDA stream.
The improvement is when compiling, setting the device to new device only if it's different from prior device, and removing redundant call to cudaFree
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38365

Reviewed By: zheng-xq

Differential Revision: D21537469

Pulled By: protonu

fbshipit-source-id: b9662dd623b5c7cfd23eb6894e992a43665641e4
2020-05-14 11:17:17 -07:00
af597335d4 Remove unnecessary to_string in RPC logging code. (#38414)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38414

`std::to_string` call is unnecessary when using glog.
ghstack-source-id: 104030161

Test Plan: Ran the retry tests and checked logs to ensure correct message was printed upon message failure,

Differential Revision: D21266330

fbshipit-source-id: 53519287778d47d99b94ea34b7c551f910affda2
2020-05-14 10:57:00 -07:00
2f4da7c00c Remove a use of exec (#35624)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35624

Python 2 has reached end-of-life and is no longer supported by PyTorch.
This test case is valid syntax in Python 3.

Test Plan: CI

Differential Revision: D20842877

Pulled By: dreiss

fbshipit-source-id: 856e72171496aa1d517f2f27a8a5066462cf4f76
2020-05-14 10:08:04 -07:00
7f7fdb1013 Remove a use of checkScript(str) (#35623)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35623

Python 2 has reached end-of-life and is no longer supported by PyTorch.
This test case is valid syntax in Python 3.

Test Plan: CI

Differential Revision: D20842874

Pulled By: dreiss

fbshipit-source-id: 9f12e046f827d4f9d5eca99b0b0b46f73e06ff51
2020-05-14 10:07:58 -07:00
313bea84ef Remove _get_wrapped_func (#35621)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35621

Python 2 has reached end-of-life and is no longer supported by PyTorch.
`func.__wrapped__` can be used directly in Python 3.

Test Plan: CI

Differential Revision: D20842875

Pulled By: dreiss

fbshipit-source-id: 26f71df12db6d5118c8f278b27d747d647d07900
2020-05-14 10:07:53 -07:00
d060deb5bb Remove _compatible_subtest (#35620)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35620

Python 2 has reached end-of-life and is no longer supported by PyTorch.
`self.subTest` can be used directly in Python 3.

Test Plan: CI

Differential Revision: D20842872

Pulled By: dreiss

fbshipit-source-id: 6ad42550c01e6959821ff07df767fc14b58c5a9e
2020-05-14 10:07:48 -07:00
7026b39ac7 Remove _uses_true_division (#35618)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35618

Python 2 has reached end-of-life and is no longer supported by PyTorch.
Python 3 always uses true division.

Test Plan: CI

Differential Revision: D20842884

Pulled By: dreiss

fbshipit-source-id: 522e34bb584d4bdb01c9c40eb267955062a57774
2020-05-14 10:07:42 -07:00
328fc70b84 Remove (most) Python 2 support from setup.py (#35617)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35617

Python 2 has reached end-of-life and is no longer supported by PyTorch.
Now we can clean up some cruft that we put in place to support it.

Test Plan: CI

Differential Revision: D20842883

Pulled By: dreiss

fbshipit-source-id: 18dc5219ba99658c0ca7e2f26863df008c420e6a
2020-05-14 10:06:20 -07:00
cbff959bd7 [quant] Return default qconfig when backend is 'none' (#38407)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38407

We can still run some quantized tests even when fbgemm/qnnpack isn't enabled

Test Plan:
python test/test_quantization.py

Imported from OSS

Differential Revision: D21554257

fbshipit-source-id: e4fa8f61f6a6717881c00620ed7938c01ffbf958
2020-05-14 09:53:50 -07:00
7f11079769 Delete "named_guard" in native_functions.yaml (#38429)
Summary:
"named_guard" is not a supported option (i.e., a typo).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38429

Differential Revision: D21572794

Pulled By: zou3519

fbshipit-source-id: 6e799611344f373b03f64410d7af9c2c89a75f55
2020-05-14 09:48:23 -07:00
25f918548d Allow GradScaler to be pickled (#38296)
Summary:
Should unblock https://github.com/PyTorchLightning/pytorch-lightning/issues/1782.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38296

Differential Revision: D21553296

Pulled By: albanD

fbshipit-source-id: 9041a72d7cf8833e4b01bc767fd2321f17c7c5f2
2020-05-14 09:14:28 -07:00
ae392a77a6 Add better device idx parse checks (#37376)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/32079
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37376

Differential Revision: D21476036

Pulled By: zou3519

fbshipit-source-id: 86907083c23cbaf165b645307fb340f2656b814e
2020-05-14 09:07:12 -07:00
0a159b0a3a Fix precision issues in CPU remainder (#38293)
Summary:
Together with https://github.com/pytorch/pytorch/issues/37758, this fixes https://github.com/pytorch/pytorch/issues/37743 and fixes https://github.com/pytorch/pytorch/issues/24861.

This follows the CUDA fix in https://github.com/pytorch/pytorch/issues/37758, vectorised using a `blendv` to replace the if conditionals.

Most of the complication is from `remainder` supporting `at::Half` where `fmod` doesn't. I've now got `fmod` working on `Vec256<at::Half>` as well as enabling half dispatch for `fmod` so it matches `remainder`.

I also added `fmod` support to `Vec256<at::BFloat16>` before realising that `remainder` doesn't support `BFloat16` anyway. I could also enable `BFloat16` if that's desirable. If not, I don't think `Vec256<BFloat16>` should be missing `fmod` anyway.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38293

Differential Revision: D21539801

Pulled By: ezyang

fbshipit-source-id: abac6a3ed2076932adc459174cd3d8d510f3e1d5
2020-05-14 08:54:32 -07:00
3e9b4332d2 Fix @skipIfNoFBGEMM for types (#38432)
Summary:
Return unmodified type from decorator if fbgemm is present.

Fix `Tried to trace <__torch__.torch.classes.rnn.CellParamsBase object at 0x55f504c56b40> but it is not part of the active trace. Modules that are called during a trace must be registered as submodules of the thing being traced` thrown from `TestPostTrainingDynamic.test_quantized_rnn`  by preserving modules in returned qRNNBase (i.e. by partially reverting https://github.com/pytorch/pytorch/pull/38134 )
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38432

Differential Revision: D21567333

Pulled By: malfet

fbshipit-source-id: 364fa2c8fc6e400b4f2e425b922a977756aec1d8
2020-05-14 08:27:29 -07:00
628e3b6fbd Fix unreachable validation for gradcheck (#37915)
Summary:
Hi, I found the validation that is unreachable in `gradcheck` function :)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37915

Differential Revision: D21551661

Pulled By: albanD

fbshipit-source-id: 8acadcc09cd2afb539061eda0ca5e98860e321eb
2020-05-14 08:18:14 -07:00
48c0331e01 Sparse softmax support (CPU) (#36305)
Summary:
This PR implements softmax support for sparse tensors.

The sparse softmax is related to dense softmax when the values of unspecified sparse tensor entries are taken to be `-inf` that will have the effect of "zero entries ignored". This relation is used for testing the correctness of results here.

Resolves https://github.com/pytorch/pytorch/issues/23651 for CPU.

- [x] sparse softmax
  - [x] CPU C++ implementation
  - [x] unittests
  - [x] update softmax documentation
  - [x] autograd support
- [x] sparse log_softmax
  - [x] CPU C++ implementation
  - [x] unittests
  - [x] update log_softmax documentation
  - [x] autograd support
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36305

Differential Revision: D21566540

Pulled By: ezyang

fbshipit-source-id: a632ea69c38622f960721482e442efeb8d0a54fc
2020-05-14 08:08:40 -07:00
fedb70a8fb Fix encoding errors for hipify tool (#37906)
Summary:
Encoding errors occur when using anaconda python 3.6.10 to run hipify_python.py, e.g., "' ASCII 'codec can't decode byte 0xc3".
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37906

Differential Revision: D21549531

Pulled By: ezyang

fbshipit-source-id: 2ffb5787e192a5c03711baa5c7e2577cb5bcab5a
2020-05-14 08:07:04 -07:00
2b2d2168e8 Issue #27441 Fix: Bug in updating ModuleDict & ParameterDict (#27814)
Summary:
Fix a bug in `nn.ModuleDict.update` and `nn.ParameterDict.update` when passing another same dictionary as input.
Related issue: [Issue https://github.com/pytorch/pytorch/issues/27441](https://github.com/pytorch/pytorch/issues/27441)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/27814

Differential Revision: D21518099

Pulled By: ezyang

fbshipit-source-id: 9e6bb6fcc26c8070e137e2e52c65f69a1fcaab37
2020-05-14 08:01:41 -07:00
15da26f8aa DOC: Add documentation for Tensor.is_nonzero (#37845)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/37438 by adding documentation for `Tensor.is_nonzero`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37845

Differential Revision: D21494422

Pulled By: mruberry

fbshipit-source-id: ee4f5979922d7c8100b5031d770ccdf59fe1c1a1
2020-05-14 04:46:55 -07:00
96885f73ed make test_jit infer the profiling mode, add a job for simple executor (#38374)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/38374

Differential Revision: D21567658

Pulled By: Krovatkin

fbshipit-source-id: c0eb44cf6c842d5feebabf8c7d99c1b4aa6c4960
2020-05-13 23:55:40 -07:00
b5868b2833 Relax sampler check in BatchSampler (#38403)
Summary:
Since the check was added in https://github.com/pytorch/pytorch/pull/6249, one can not pass an iterable as a sampler to the data loader anymore, which was a very handy feature (e.g., https://github.com/pytorch/pytorch/issues/1337). I think the check should be removed for two-fold reasons:
1. It is too strict. There is no reason that it should not be a general iterable.
2. It is inconsistent. In `DataLoader` (the main place where people use samplers), you can pass a general iterable as `batch_sampler` but not `sampler` due to this check.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38403

Differential Revision: D21555958

Pulled By: soumith

fbshipit-source-id: c7267bb99a31edd8f2750689205d6edc5dab5cff
2020-05-13 22:24:29 -07:00
f3d2e332f1 [PyTorch] Remove duplicate jit core sources filelists (#38430)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38430

Add `jit_core_[sources|headers]` to `build_variables.bzl`, use them from BUILD.bazel as wel as from internal build systems

Test Plan: CI

Reviewed By: suo

Differential Revision: D21555649

fbshipit-source-id: e78572465f36560806d646f147b2ef5a53ba1efe
2020-05-13 22:19:31 -07:00
061ed739c1 Embed ATen/core/CMakeLists.txt into its parent (#38426)
Summary:
This file were separated from main CMakeLists.txt to enable mobile builds, but at the moment it is only referenced from CMakeLists.txt in parents folder
This is a preparatory step to move `jit_core_sources`,`jit_core_headers` to build_variables.bzl
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38426

Test Plan: CI

Differential Revision: D21567389

Pulled By: malfet

fbshipit-source-id: e6340fad1da75aa3e24d6c340df0c3e1e1957595
2020-05-13 22:14:19 -07:00
f99a693cd9 Remove unnecessary py::object copy in PyRRef ctor (#38402)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/38402

Test Plan: Imported from OSS

Differential Revision: D21554724

Pulled By: mrshenli

fbshipit-source-id: abab45010810ec53628ea2c7a9c76cdc50eb2f74
2020-05-13 22:00:13 -07:00
54c16b44cf [ROCm] increase timeout, enable test_backend_group (#36166)
Summary:
CC iotamudelta
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36166

Differential Revision: D21566721

Pulled By: ezyang

fbshipit-source-id: 4fc83af918e1b427511388d9227da35a91156dfd
2020-05-13 21:46:37 -07:00
8d94615c2b Migrate erfc from TH to ATen (CUDA) (#38373)
Summary:
Fixed https://github.com/pytorch/pytorch/issues/24559
Reference https://github.com/pytorch/pytorch/issues/24507
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38373

Differential Revision: D21549626

Pulled By: ezyang

fbshipit-source-id: 84c2cf58b071df3afc312ae0aef3b5ed6c014cc7
2020-05-13 21:19:03 -07:00
beedc6542e relax MAX_JOBS restriction for ROCm builds (#38425)
Summary:
CC ezyang xw285cornell sunway513

Forcing MAX_JOBS=4 was done 2 years ago.  We have tested up to MAX_JOBS=256.  OOM issues are no longer observed.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38425

Differential Revision: D21566747

Pulled By: ezyang

fbshipit-source-id: f7f50e44a287268f1b06bcea3cb4e11c80260cc3
2020-05-13 21:12:14 -07:00
b1d2c1765e Updating submodules
Summary:
GitHub commits:

c2eda06820
7ed5f9f16c

Test Plan: n/a

Reviewed By: zpao

fbshipit-source-id: 48b466893b7537c845bc30b2880f7bb9d7c1d265
2020-05-13 20:26:37 -07:00
336e1ec592 Clean up error handling in is_nonzero and where in TensorCompare.cpp (#38150)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/38150

Differential Revision: D21539736

Pulled By: ezyang

fbshipit-source-id: e390c12f5948192a552d66dcd1bb89b2cb45f170
2020-05-13 20:19:40 -07:00
5a979fcb99 allow user passing relative paths in include_dirs within setuptools.setup (#38264)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/38264

Test Plan: Imported from OSS

Differential Revision: D21509277

Pulled By: glaringlee

fbshipit-source-id: b0bc17d375a89b96b1bdacde5987b4f4baa9468e
2020-05-13 20:00:12 -07:00
ee8bf1c640 [quant][graphmode][refactor] insertDeQuantForAllUse (#38277)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/38277

Test Plan: Imported from OSS

Differential Revision: D21559809

fbshipit-source-id: 87f73c9ec9d5be5a3224d963fed35792ca0decc1
2020-05-13 19:09:03 -07:00
eb66dd0bc8 [quant][graphmode][refactor] Refactor propagateQuantizationOps (#38276)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/38276

Test Plan: Imported from OSS

Differential Revision: D21559814

fbshipit-source-id: 31331415c30f59cde0af478cfad5e890e994ef71
2020-05-13 19:07:38 -07:00
8d883f5c7c [JIT] [Easy] Add location to implicit conversions (#38442)
Summary:
Previously, we weren't adding the location to implicit conversions, so the error message wouldn't show location when these ops failed.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38442

Differential Revision: D21563500

Pulled By: eellison

fbshipit-source-id: 19dd786ab8580f11ed919aac669efeed0ef52dcb
2020-05-13 18:02:41 -07:00
7ce733d218 [quant][graphmode] Move leaky_relu to general value op map (#38166)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/38166

Test Plan: Imported from OSS

Differential Revision: D21559813

fbshipit-source-id: 8521f7ad2b0fcd6f87090fb40517d5d92c37ba54
2020-05-13 17:51:14 -07:00
16696186e1 [quant][graphmode] Move elu to general value ops map (#38165)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/38165

Test Plan: Imported from OSS

Differential Revision: D21559812

fbshipit-source-id: 55bc28d71d0b8a1c33e05bce20a802db1015ea0b
2020-05-13 17:51:09 -07:00
98d78a7f20 [quant][graphmode] Move hardtanh to general value ops map (#38164)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/38164

Test Plan: Imported from OSS

Differential Revision: D21559808

fbshipit-source-id: 7b00e40cfa58806ce8675a61073778c4d77f8a8b
2020-05-13 17:51:03 -07:00
1fde373f2f [quant][graphmode] Move clamp to general value ops map (#38163)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/38163

Test Plan: Imported from OSS

Differential Revision: D21559805

fbshipit-source-id: db02bd17fbc6d1335fe021265955d02d52d139e6
2020-05-13 17:50:57 -07:00
e988b4fbb1 [quant][graphmode] Move interpolate to general value ops (#38162)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/38162

Test Plan: Imported from OSS

Differential Revision: D21559810

fbshipit-source-id: 2d975fc71f73c18f594108172850dfcfdb0cb9a0
2020-05-13 17:49:08 -07:00
0d220ef381 [torchbind] Better error message when missing init. (#37474)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37474

Previously we would segfault

Test Plan: Imported from OSS

Differential Revision: D21297542

Pulled By: suo

fbshipit-source-id: c7e2f828a250c490ec23fb51c6a4a642d3370e52
2020-05-13 17:38:31 -07:00
2efa7e04c2 [jit] move torchbind tests to separate file (#37473)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/37473

Test Plan: Imported from OSS

Differential Revision: D21297541

Pulled By: suo

fbshipit-source-id: 65c48094b1f26fbbf251021957257ce04279922b
2020-05-13 17:37:00 -07:00
7d7d73655d [quant][graphmode] Add quantizedconv1d to graphmode (#38341)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/38341

Test Plan:
python test/test_quantization.py TestQuantizeScriptPTSQOps.test_quantized_conv1d

Imported from OSS

Differential Revision: D21554256

fbshipit-source-id: baf78c7788a38acd9362204990f0b22c21263dfb
2020-05-13 16:59:24 -07:00
ae11718c45 [quant] Add quantized::conv1d op benchmarck (#38332)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/38332

Test Plan:
python -m pt.qconv_test --test QConv1d_N1_IC128_OC256_L64_G1_kernel3_stride1_pad0
Forward Execution Time (us) : 147.844

python -m pt.conv_test --test Conv1d_IC128_OC256_kernel3_stride1_N1_L64_cpu
Forward Execution Time (us) : 470.750

Imported from OSS

Differential Revision: D21553662

fbshipit-source-id: 9c240a141f9cd3a82a20aa462e8e5577e002a387
2020-05-13 16:59:19 -07:00
f6626aaf43 [quant] Add support for Quantized Conv1d and ConvRELU1d (#38283)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38283

Adds support for the modules and tests

Test Plan:
python test/test_quantization.py TestStaticQuantizedModule.test_conv1d_api

Imported from OSS

Differential Revision: D21553665

fbshipit-source-id: 7ea28da024bdf59f87f300d616c266f2b41f0bcd
2020-05-13 16:59:13 -07:00
2d221df52f [quant] Add support for quantized::conv1d operator (#38248)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/38248

Test Plan: Imported from OSS

Differential Revision: D21553661

fbshipit-source-id: 430b4c3244be0cf1a18bdf16788a2023c524c10b
2020-05-13 16:57:43 -07:00
1676c7d618 Added autograd tests, disabled jit autograd tests for complex and added a separate list for tests for complex dtype only (#38399)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/38399

Test Plan: Imported from OSS

Differential Revision: D21555941

Pulled By: anjali411

fbshipit-source-id: ea9f5a76590c5bab3df6a540617b074238bfb535
2020-05-13 16:41:09 -07:00
53439be643 improve some reporting for fakelowp tests (#38428)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38428

use and log a randomly generated seed with each test

Test Plan: locally tested

Reviewed By: amylittleyang

Differential Revision: D21554466

fbshipit-source-id: 008185d13116ec8553b082150a355ba87682bf6a
2020-05-13 15:56:49 -07:00
dac9b61850 Move Cuda Abs kernel to its own file. (#38274)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38274

UnarySignKernels is one of the longest files to compile and Abs is not a sign function.

Test Plan: Imported from OSS

Differential Revision: D21511831

Pulled By: gchanan

fbshipit-source-id: f8572ab21321a241c984c64f7df83e2cb5e757d5
2020-05-13 15:44:30 -07:00
ff76de8ace speed up hardswish and hardsigmoid tests (#38256)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38256

Removes hypothesis to speed these tests up, as these tests were flagged as top slow
tests in CI. At the same time, combines the fbgemm and qnnpack test
cases together for better reuse.

Test Plan:
```
python test/test_quantization.py TestQuantizedOps.test_hardswish
python test/test_quantization.py TestQuantizedOps.test_qhardsigmoid
```

Imported from OSS

Differential Revision: D21506831

fbshipit-source-id: 9ff70e4ec7ae30b6948fe808878f0187e631f4d8
2020-05-13 15:37:51 -07:00
afa4dbd731 Use GIL to guard decref of jit::toPyObj return value in processRpc (#38376)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/38376

Test Plan: Imported from OSS

Differential Revision: D21540179

Pulled By: mrshenli

fbshipit-source-id: 082fa5f11da7fc1f083710b498e72abc5ba2c244
2020-05-13 15:36:12 -07:00
33977ca769 Update Cpp, rpc docs and Libraries section to match 1.5 (#38350)
Summary:
* Link cpp docs to the cpp landing page
* Link to rpc.rst landing page
* Update Libraries to match 1.5 (https://github.com/pytorch/pytorch/blob/release/1.5/docs/source/index.rst)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38350

Differential Revision: D21554435

Pulled By: jlin27

fbshipit-source-id: d1c9d5a86f84910225cbd0a57074ae95c8a9a450
2020-05-13 15:20:35 -07:00
328dd9e5d6 [future] Make new IValue future constValue semantics match torch::utils counterpart (#38355)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38355

The torch::utils::Future api which this api was copied from last week
intentionally does not throw. Harmonize the semantics and comment
appropriately.
ghstack-source-id: 104014210

Test Plan: buck test mode/dev-nosan caffe2/test/...

Differential Revision: D21533016

fbshipit-source-id: db26af32656d7b9dacf4fad4e77c944a0087c9b0
2020-05-13 15:02:23 -07:00
b668bbc404 [quant][graphmode][refactor] Factor out common parts of general value ops (#38161)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/38161

Test Plan: Imported from OSS

Differential Revision: D21512972

fbshipit-source-id: 61425f7c51fe5972527432b74407486aa479d999
2020-05-13 14:17:45 -07:00
6e13146d96 [TensorExpr] TensorExprKernel: don't do any compilation or lowering in run(). (#37948)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37948

The input JIT graph has all the information we need to perform the
entire compilation at the construction time. We don't need to postpone
any steps until the execution time. Also, from the graph we always know
what device we will be executing on and thus we don't need to have a
CodeGen cache in TensorExprKernel - we always have one and only one
CodeGen.

Test Plan: Imported from OSS

Reviewed By: protonu

Differential Revision: D21432145

Pulled By: ZolotukhinM

fbshipit-source-id: 8dc86b891713056b2c62f30170cd4a168912f027
2020-05-13 14:02:23 -07:00
eac54f18b8 Vectorize SmoothL1Loss forward (CPU) (#37115)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37115

Benchmark (Debian 10, Release build, gcc 8.3, no turbo, Intel(R) Xeon(R) E-2136 CPU @ 3.30GHz):

```python
import timeit
for op in ('SmoothL1Loss',):
    print('Forward')
    for dtype in ('torch.double', 'torch.float', 'torch.bfloat16'):
        for n, t in [(10_000, 100000),
                    (100_000, 10000)]:
            print(f'torch.nn.{op}()(a, b), |a-b|>1, numel() == {n} for {t} times, dtype={dtype}')
            print(timeit.timeit('m(a, b)', setup=f'import torch; m = torch.nn.{op}(); a = torch.full(({n},), 1, dtype={dtype}); b = torch.full(({n},), 3, dtype={dtype})', number=t))
            print(f'torch.nn.{op}()(a, b), |a-b|<1, numel() == {n} for {t} times, dtype={dtype}')
            print(timeit.timeit('m(a, b)', setup=f'import torch; m = torch.nn.{op}(); a = torch.full(({n},), 1, dtype={dtype}); b = torch.full(({n},), 1.5, dtype={dtype})', number=t))
```

Results:

Before:

```
Forward
torch.nn.SmoothL1Loss()(a, b), |a-b|>1, numel() == 10000 for 100000 times, dtype=torch.double
2.8427017140056705
torch.nn.SmoothL1Loss()(a, b), |a-b|<1, numel() == 10000 for 100000 times, dtype=torch.double
2.823863306999556
torch.nn.SmoothL1Loss()(a, b), |a-b|>1, numel() == 100000 for 10000 times, dtype=torch.double
0.9239509999897564
torch.nn.SmoothL1Loss()(a, b), |a-b|<1, numel() == 100000 for 10000 times, dtype=torch.double
0.9014650480094133
torch.nn.SmoothL1Loss()(a, b), |a-b|>1, numel() == 10000 for 100000 times, dtype=torch.float
2.4530331650021253
torch.nn.SmoothL1Loss()(a, b), |a-b|<1, numel() == 10000 for 100000 times, dtype=torch.float
2.4551637870026752
torch.nn.SmoothL1Loss()(a, b), |a-b|>1, numel() == 100000 for 10000 times, dtype=torch.float
0.5716871829936281
torch.nn.SmoothL1Loss()(a, b), |a-b|<1, numel() == 100000 for 10000 times, dtype=torch.float
0.5748704470024677
torch.nn.SmoothL1Loss()(a, b), |a-b|>1, numel() == 10000 for 100000 times, dtype=torch.bfloat16
9.777982015002635
torch.nn.SmoothL1Loss()(a, b), |a-b|<1, numel() == 10000 for 100000 times, dtype=torch.bfloat16
12.627838339001755
torch.nn.SmoothL1Loss()(a, b), |a-b|>1, numel() == 100000 for 10000 times, dtype=torch.bfloat16
7.810075458997744
torch.nn.SmoothL1Loss()(a, b), |a-b|<1, numel() == 100000 for 10000 times, dtype=torch.bfloat16
10.73597132100258
```

After:

```
Forward
torch.nn.SmoothL1Loss()(a, b), |a-b|>1, numel() == 10000 for 100000 times, dtype=torch.double
2.8420191049808636
torch.nn.SmoothL1Loss()(a, b), |a-b|<1, numel() == 10000 for 100000 times, dtype=torch.double
2.8814279660000466
torch.nn.SmoothL1Loss()(a, b), |a-b|>1, numel() == 100000 for 10000 times, dtype=torch.double
0.9491433810035232
torch.nn.SmoothL1Loss()(a, b), |a-b|<1, numel() == 100000 for 10000 times, dtype=torch.double
0.9144560259883292
torch.nn.SmoothL1Loss()(a, b), |a-b|>1, numel() == 10000 for 100000 times, dtype=torch.float
2.4458729829930235
torch.nn.SmoothL1Loss()(a, b), |a-b|<1, numel() == 10000 for 100000 times, dtype=torch.float
2.4474395569995977
torch.nn.SmoothL1Loss()(a, b), |a-b|>1, numel() == 100000 for 10000 times, dtype=torch.float
0.5676976410031784
torch.nn.SmoothL1Loss()(a, b), |a-b|<1, numel() == 100000 for 10000 times, dtype=torch.float
0.5793530470109545
torch.nn.SmoothL1Loss()(a, b), |a-b|>1, numel() == 10000 for 100000 times, dtype=torch.bfloat16
4.32380092900712
torch.nn.SmoothL1Loss()(a, b), |a-b|<1, numel() == 10000 for 100000 times, dtype=torch.bfloat16
4.332892568985699
torch.nn.SmoothL1Loss()(a, b), |a-b|>1, numel() == 100000 for 10000 times, dtype=torch.bfloat16
2.3354615129937883
torch.nn.SmoothL1Loss()(a, b), |a-b|<1, numel() == 100000 for 10000 times, dtype=torch.bfloat16
2.3352111729909666
```

Test Plan: Imported from OSS

Differential Revision: D21351860

Pulled By: VitalyFedyunin

fbshipit-source-id: b19ca1e58586d964972e5c495aba10c8808cd747
2020-05-13 12:50:40 -07:00
b90fc52c68 [quant] Implement unsqueeze/squeeze for per-channel qtensor (#38247)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38247

Per-channel quantized tensor axis value is shifted based on the unsqueeze/squeeze dim

Test Plan:
python test/test_quantization.py TestQuantizedTensor.test_qtensor_unsqueze

Imported from OSS

Differential Revision: D21550293

fbshipit-source-id: 90ea4a1bd637588360b3228cb5af9176176eb033
2020-05-13 12:45:55 -07:00
0526eb0f08 Fix aten_add. aten_sub to handle 2-operand versions (#38367)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/38367

Reviewed By: Krovatkin

Differential Revision: D21550736

Pulled By: protonu

fbshipit-source-id: 83491d35cc9168af2208c4f19c423d23e7de836d
2020-05-13 12:26:33 -07:00
d403b85c00 [quant][graphmode] Move aten::mean to general value ops (#38160)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/38160

Test Plan: Imported from OSS

Differential Revision: D21512971

fbshipit-source-id: 98cb1cc0eec5e7b140dcdf4e756bdbcd724b98f3
2020-05-13 11:39:22 -07:00
2a54533c64 Fix the flooding log issues (#38356)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38356

Reduce the log size
ghstack-source-id: 103997991

Test Plan: CI

Reviewed By: jspark1105

Differential Revision: D21532296

fbshipit-source-id: d5ab5a8acc18a2b4210131d0d6b932e293c303a9
2020-05-13 11:23:17 -07:00
f64d24c941 speed up SyncBatchNorm by batching distributed communication (#38246)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38246

Speeds up SyncBatchNorm by batching the distributed communication.
Initial benchmarks show a ~15+% speed improvement on MobileNetV2 and
EfficientNetB3 on a single machine with 8 gpus. Improvement
vs baseline increases as # of gpus increases.

Test Plan:
verified that before+after intermediate values in fwd/bwd pass are equivalent (with `torch.allclose`)

benchmark runner:
https://gist.github.com/vkuzo/7b1ce1b1b051ee6d46877d0f18ab9b1f

results (1 forward pass + 1 backward pass, 1 machine, 8x Tesla-P100, batch_size=20 per node):
```
model           gpus  before_ms after_ms  speedup
efficientnet-b3 2     660       654       0.00909
efficientnet-b3 4     777       710       0.08623
efficientnet-b3 8     988       838       0.15182
mobilenet-v2    2     267       266       0.00375
mobilenet-v2    4     328       289       0.1189
mobilenet-v2    8     453       373       0.1766
```

Imported from OSS

Differential Revision: D21505905

fbshipit-source-id: 3e796343fce8329a2e17671d60ae66c0387924e7
2020-05-13 11:21:42 -07:00
899a075b25 Split up BinaryAritmeticKernel.cu to speed up compilation time. (#38263)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38263

On my machine, compilation went from 4m8sec to the maximum of the files being compiled in 2m22sec.

Test Plan: Imported from OSS

Differential Revision: D21508985

Pulled By: gchanan

fbshipit-source-id: 2917cd5f30c6b31229053cada93c95e3a27ab29a
2020-05-13 10:51:05 -07:00
d86de916a9 Migrate exp and exp_ from the TH to Aten (CUDA) (#36652)
Summary:
Closes https://github.com/pytorch/pytorch/issues/24561

Benchmark with same build settings on same system.
gcc : version 7.5.0 (Ubuntu 7.5.0-3ubuntu1~18.04)
CUDA : 10.1
GPU : 1050ti

```python
import timeit

for n, t in [(10_000, 20000),
             (100_000, 20000)]:
    for dtype in ('torch.half', 'torch.float', 'torch.double'):
        print(f'torch.exp(a) a.numel() == {n} for {t} times {dtype}')
        print(timeit.timeit(f'torch.exp(a); torch.cuda.synchronize()',
                              setup=f'import torch; a=torch.arange({n}, dtype={dtype}, device="cuda")',
                              number=t))
```

Before:

```
torch.exp(a) a.numel() == 10000 for 20000 times torch.half
0.3001665159999902
torch.exp(a) a.numel() == 10000 for 20000 times torch.float
0.28265794499998265
torch.exp(a) a.numel() == 10000 for 20000 times torch.double
0.3432170909998149
torch.exp(a) a.numel() == 100000 for 20000 times torch.half
0.32273333800003456
torch.exp(a) a.numel() == 100000 for 20000 times torch.float
0.31498759600003723
torch.exp(a) a.numel() == 100000 for 20000 times torch.double
1.079708754999956
```

After:

```
torch.exp(a) a.numel() == 10000 for 20000 times torch.half
0.27996097300092515
torch.exp(a) a.numel() == 10000 for 20000 times torch.float
0.2774473429999489
torch.exp(a) a.numel() == 10000 for 20000 times torch.double
0.33066844799941464
torch.exp(a) a.numel() == 100000 for 20000 times torch.half
0.27641824200145493
torch.exp(a) a.numel() == 100000 for 20000 times torch.float
0.27805968599932385
torch.exp(a) a.numel() == 100000 for 20000 times torch.double
1.0644143180015817
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36652

Differential Revision: D21164653

Pulled By: VitalyFedyunin

fbshipit-source-id: 42c7b24b0d85ff1d390231f1457968a8869b8db3
2020-05-13 10:06:51 -07:00
e7b4ef8fd3 Revert "Partial revert of #38144 to fix ROCm CI. (#38363)" (#38380)
Summary:
The changes in this file broke ROCm and got reverted in https://github.com/pytorch/pytorch/issues/38363. This PR brings it back with ROCm fixed.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38380

Differential Revision: D21549632

Pulled By: ezyang

fbshipit-source-id: 68498aba70e651352d58fd0c865e71420dbf900a
2020-05-13 09:58:23 -07:00
f2c6346ebe [quant][graphmode] Move avg_pool/adaptive_avg_pool to general value ops (#38330)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/38330

Test Plan:
python test/test_quantization.py TestQuantizeScriptPTSQOps.test_quantize_general_value_ops

Imported from OSS

Differential Revision: D21533452

fbshipit-source-id: 56928d93624f7c3d5c61f2627a19c5d3bb595202
2020-05-13 09:22:24 -07:00
138769b1b8 [ROCm] add exact_dtype=False to bfloat16 test (#38381)
Summary:
CC rohithkrn ezyang xw285cornell

Fixes
- TestNNDeviceTypeCUDA.test_activations_bfloat16_cuda
- TestNNDeviceTypeCUDA.test_pooling_bfloat16_cuda
- TestNNDeviceTypeCUDA.test_softmax_bfloat16_cuda
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38381

Differential Revision: D21549636

Pulled By: ezyang

fbshipit-source-id: acb290c57eff4077b040a696267ecde613f0a433
2020-05-13 08:48:18 -07:00
61bea93fca Further parallelize linspace in addition to AVX (#38093)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38093

Benchmark (gcc 8.3, Debian Buster, turbo off, Release build, Intel(R) Xeon(R) E-2136, Parallelization using OpenMP):

```
import timeit
for dtype in ('torch.double', 'torch.float', 'torch.uint8', 'torch.int8', 'torch.int16', 'torch.int32', 'torch.int64'):
    for n, t in [(40_000, 50000),
                (400_000, 5000)]:
        print(f'torch.linspace(0, 10, {n}, dtype={dtype}) for {t} times')
        print(timeit.timeit(f'torch.linspace(0, 10, {n}, dtype={dtype})', setup=f'import torch', number=t))
```

With AVX
========

Before:

```
torch.linspace(0, 10, 40000, dtype=torch.double) for 50000 times
1.0942596640015836
torch.linspace(0, 10, 400000, dtype=torch.double) for 5000 times
0.9209065200011537
torch.linspace(0, 10, 40000, dtype=torch.float) for 50000 times
1.0520610109997506
torch.linspace(0, 10, 400000, dtype=torch.float) for 5000 times
0.9031864690005023
torch.linspace(0, 10, 40000, dtype=torch.uint8) for 50000 times
1.949299545998656
torch.linspace(0, 10, 400000, dtype=torch.uint8) for 5000 times
1.82629113800067
torch.linspace(0, 10, 40000, dtype=torch.int8) for 50000 times
1.9547776939980395
torch.linspace(0, 10, 400000, dtype=torch.int8) for 5000 times
1.8259895039991534
torch.linspace(0, 10, 40000, dtype=torch.int16) for 50000 times
2.759497356000793
torch.linspace(0, 10, 400000, dtype=torch.int16) for 5000 times
2.6285490109985403
torch.linspace(0, 10, 40000, dtype=torch.int32) for 50000 times
2.3456633150017296
torch.linspace(0, 10, 400000, dtype=torch.int32) for 5000 times
2.2031515989983745
torch.linspace(0, 10, 40000, dtype=torch.int64) for 50000 times
2.559069258000818
torch.linspace(0, 10, 400000, dtype=torch.int64) for 5000 times
2.378239962999942
```

After:

```
torch.linspace(0, 10, 40000, dtype=torch.double) for 50000 times
0.8100852870011295
torch.linspace(0, 10, 400000, dtype=torch.double) for 5000 times
0.18943897200006177
torch.linspace(0, 10, 40000, dtype=torch.float) for 50000 times
0.6679975400002149
torch.linspace(0, 10, 400000, dtype=torch.float) for 5000 times
0.17846923400065862
torch.linspace(0, 10, 40000, dtype=torch.uint8) for 50000 times
1.1431112539976311
torch.linspace(0, 10, 400000, dtype=torch.uint8) for 5000 times
0.3336703610002587
torch.linspace(0, 10, 40000, dtype=torch.int8) for 50000 times
1.157699686998967
torch.linspace(0, 10, 400000, dtype=torch.int8) for 5000 times
0.32964968899977976
torch.linspace(0, 10, 40000, dtype=torch.int16) for 50000 times
1.5379577429994242
torch.linspace(0, 10, 400000, dtype=torch.int16) for 5000 times
0.4638638729993545
torch.linspace(0, 10, 40000, dtype=torch.int32) for 50000 times
1.360489848000725
torch.linspace(0, 10, 400000, dtype=torch.int32) for 5000 times
0.4033017760011717
torch.linspace(0, 10, 40000, dtype=torch.int64) for 50000 times
1.4591587399991113
torch.linspace(0, 10, 400000, dtype=torch.int64) for 5000 times
0.44132660000104806
```

Without AVX
===========

Before:

```
torch.linspace(0, 10, 40000, dtype=torch.double) for 50000 times
3.4967273879992717
torch.linspace(0, 10, 400000, dtype=torch.double) for 5000 times
3.330881046000286
torch.linspace(0, 10, 40000, dtype=torch.float) for 50000 times
2.176502857997548
torch.linspace(0, 10, 400000, dtype=torch.float) for 5000 times
2.023505228000431
torch.linspace(0, 10, 40000, dtype=torch.uint8) for 50000 times
2.117801246000454
torch.linspace(0, 10, 400000, dtype=torch.uint8) for 5000 times
1.9885458380013006
torch.linspace(0, 10, 40000, dtype=torch.int8) for 50000 times
2.1057261179994384
torch.linspace(0, 10, 400000, dtype=torch.int8) for 5000 times
1.9809251260012388
torch.linspace(0, 10, 40000, dtype=torch.int16) for 50000 times
3.187070896001387
torch.linspace(0, 10, 400000, dtype=torch.int16) for 5000 times
3.049615387000813
torch.linspace(0, 10, 40000, dtype=torch.int32) for 50000 times
3.4874590049985272
torch.linspace(0, 10, 400000, dtype=torch.int32) for 5000 times
3.33596555099939
torch.linspace(0, 10, 40000, dtype=torch.int64) for 50000 times
4.256659758000751
torch.linspace(0, 10, 400000, dtype=torch.int64) for 5000 times
4.100936053000623
```

After:

```
torch.linspace(0, 10, 40000, dtype=torch.double) for 50000 times
1.9155298300029244
torch.linspace(0, 10, 400000, dtype=torch.double) for 5000 times
0.598213522000151
torch.linspace(0, 10, 40000, dtype=torch.float) for 50000 times
1.3183841649988608
torch.linspace(0, 10, 400000, dtype=torch.float) for 5000 times
0.40136947100108955
torch.linspace(0, 10, 40000, dtype=torch.uint8) for 50000 times
1.2191377319977619
torch.linspace(0, 10, 400000, dtype=torch.uint8) for 5000 times
0.35984685299990815
torch.linspace(0, 10, 40000, dtype=torch.int8) for 50000 times
1.2153874989999167
torch.linspace(0, 10, 400000, dtype=torch.int8) for 5000 times
0.35752785600197967
torch.linspace(0, 10, 40000, dtype=torch.int16) for 50000 times
1.750796647000243
torch.linspace(0, 10, 400000, dtype=torch.int16) for 5000 times
0.5376063230032742
torch.linspace(0, 10, 40000, dtype=torch.int32) for 50000 times
1.9153429929974664
torch.linspace(0, 10, 400000, dtype=torch.int32) for 5000 times
0.5952553579991218
torch.linspace(0, 10, 40000, dtype=torch.int64) for 50000 times
2.281823589000851
torch.linspace(0, 10, 400000, dtype=torch.int64) for 5000 times
0.7391443560009066
```

Differential Revision: D21528099

Test Plan: Imported from OSS

Pulled By: malfet

fbshipit-source-id: a6b3904e7860bb6d652a48b2056154509e73157d
2020-05-12 23:48:31 -07:00
9a2d8dfe63 [TensorExpr] Benchmarks: set up profiling executor and fuser according to the given arguments. (#38295)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/38295

Test Plan: Imported from OSS

Differential Revision: D21525741

Pulled By: ZolotukhinM

fbshipit-source-id: 8bf1d54da062c8e0653bb2cb627883ae4ed14774
2020-05-12 23:27:46 -07:00
3a478b1cbf Updating submodules
Summary:
GitHub commits:

483ccc940c

Test Plan: n/a

Reviewed By: zpao

fbshipit-source-id: b4fc9fdab591b9ff10a6a228c6c92166518253c4
2020-05-12 23:24:10 -07:00
167a978a03 Fix method stub creation for function attributes (#37994)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37994

Before, reassigning a method in a module (like `forward = _forward`)
didn't work, because we look at the function object's name for our def
name when building AST. Mkae that overrideable to handle cases like
reassignment

Test Plan: Imported from OSS

Differential Revision: D21444535

Pulled By: suo

fbshipit-source-id: 4f045f18b5a146edc8005689af525d7d7ed8dd5f
2020-05-12 23:20:35 -07:00
3d968088e0 fix multinomial kernels to properly advance random states (#38046)
Summary:
Before, multinomial kernels did not advance random states enough, which lead to the same sequence being generated over and over with a shift of 4. This PR fixes that.
Fixes https://github.com/pytorch/pytorch/issues/37403
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38046

Differential Revision: D21516542

Pulled By: ngimel

fbshipit-source-id: 23248a8c3a5c44316c4c35cd71a8c3b5f76c90f2
2020-05-12 22:33:11 -07:00
756788ea87 Keep py::object alive until jit::toIValue returns (#38348)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/38348

Test Plan: Imported from OSS

Differential Revision: D21530282

Pulled By: mrshenli

fbshipit-source-id: a507402fbbd89618936ac6eecb4a223ab86236c6
2020-05-12 22:16:17 -07:00
e39991e838 [TensorPipe Agent] Bind default IP address (#37910)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37910

To resolve the issue: https://github.com/pytorch/pytorch/issues/36715

In tensorpipe rpc agent, we currently hardcoded localhost as pipes handshake IP address. This prevents us from setting up cross-host connections. As the first step, we start binding IP address for a given network device. For now it's defaulted to eth0. Will provide options to let user configure

Test Plan: CI

Reviewed By: lw

Differential Revision: D21421094

fbshipit-source-id: 60f612cbaeddcef7bd285136ad75af20709a7d56
2020-05-12 21:09:11 -07:00
c20b0080c6 Partial revert of #38144 to fix ROCm CI. (#38363)
Summary:
CC ezyang xw285cornell
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38363

Differential Revision: D21539778

Pulled By: ezyang

fbshipit-source-id: 0f7d3b8e3b30ab4d5992f1c13aa8d48069796a8d
2020-05-12 21:03:19 -07:00
797c608f50 Explicitly decref py::object in PythonRpcHandler (#38366)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/38366

Test Plan: Imported from OSS

Differential Revision: D21537612

Pulled By: mrshenli

fbshipit-source-id: 089bcc3d7de3bce6e769f72d67e0e0f91e0219c6
2020-05-12 20:55:59 -07:00
2e9d6d99be Explicitly decref py::object in ConcretePyObjectHolder and PythonFunctionGuard (#38364)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/38364

Test Plan: Imported from OSS

Differential Revision: D21537611

Pulled By: mrshenli

fbshipit-source-id: e22d1f1360cf71bec526841b5014013b11316f8d
2020-05-12 20:55:53 -07:00
d001862aff Minor code cleanup (#38340)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/38340

Test Plan: Imported from OSS

Differential Revision: D21530281

Pulled By: mrshenli

fbshipit-source-id: 358bdcb6b2eb3ed871fd8b699438b0ef05362613
2020-05-12 20:54:11 -07:00
6be3e5d3bb [caffe2] weight_decay in reduced precision adagrad
Summary: As title

Test Plan: CI

Reviewed By: taiqing

Differential Revision: D21512729

fbshipit-source-id: 0777c90954ebad0cbd5785460e7b2a7c8c146316
2020-05-12 20:33:40 -07:00
cfe3c795ed Port torch/csrc/jit/runtime/register_distributed_ops.cpp to new operator registration API (#38014)
Summary:
Port register_distributed_ops.cpp with the new registration API introduced in https://github.com/pytorch/pytorch/issues/36258.

resolve https://github.com/pytorch/pytorch/issues/37579

Signed-off-by: Xiong Wei <xiongw.fnst@cn.fujitsu.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38014

Differential Revision: D21502643

Pulled By: ezyang

fbshipit-source-id: e1749d788b5c0f2a903ffac2f0c94929d6a8ad72
2020-05-12 19:14:18 -07:00
34523b70c1 Renamed *_transformation to transformation::* (#38301)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/38301

Test Plan: Imported from OSS

Differential Revision: D21534886

Pulled By: pbelevich

fbshipit-source-id: 44baa563b6e8624bcc6290c4054e9b3189bad69b
2020-05-12 19:11:37 -07:00
4f08bdddfc Add skipIfNoSciPy/get_all_int_dtypes/get_all_fp_dtypes to common_utils (#38299)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/38299

Test Plan: Imported from OSS

Differential Revision: D21534876

Pulled By: pbelevich

fbshipit-source-id: 864881b3be899aea3660039128d9bc2e94edab95
2020-05-12 19:11:31 -07:00
00be4abc38 Fixing DistributionsHelper.h includes (#38298)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38298

Moved unnecessary includes to `THTensorRandom.cpp`

Test Plan: Imported from OSS

Differential Revision: D21534864

Pulled By: pbelevich

fbshipit-source-id: bfec9cf5ce7587b1bd1674bc47850c16446621e9
2020-05-12 19:11:26 -07:00
70c6550cc9 Forgotten changes for Tensor.random_()'s from and to bounds for floating-point types (#38287)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/38287

Test Plan: Imported from OSS

Differential Revision: D21534847

Pulled By: pbelevich

fbshipit-source-id: 6ea972186789347555efbbf68407b5f12960dae6
2020-05-12 19:09:37 -07:00
eb3e9872c9 [JIT] make torch.unique compilable (#38156)
Summary:
Fix for https://github.com/pytorch/pytorch/issues/37986

Follows the stack in https://github.com/pytorch/pytorch/pull/33783 stack to make functions in `torch/functional.py` resolve to their python implementations. Because the return type of `torch.unique` depends on `return_inverse` and `return_counts` I had to refactor the implementation to use our boolean_dispatch mechanism.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38156

Differential Revision: D21504449

Pulled By: eellison

fbshipit-source-id: 7efb1dff3b5c00655da10168403ac4817286ff59
2020-05-12 18:37:53 -07:00
4a266c93a6 Allow specifying range in and cpu_serial_kernel and cpu_serial_kernel_vec (#37981)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37981

This additional parameter may be helpful in parallelize range factories

Differential Revision: D21506744

Test Plan: Imported from OSS

Pulled By: malfet

fbshipit-source-id: be9418216510ae600c555188971663fafb413fa0
2020-05-12 18:32:33 -07:00
f7e7a15a5d Fix NaN comparison in torch.median (#38216)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/38018

when calling `eq_with_nan(v, kValue)` having `v` and `kValue` both `nan` is returning `false` when it should be `true`.
https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/native/cuda/SortingKthValue.cu#L76

The implementation is using intrinsics such as `__double_as_longlong` and comparing their bit representations. But the values of the bits obtained for both nans are different.
`9221120237041090560` for `v`
`9223372036854775807` for `kValue`

two different nans have different bit representations, so we have to do additional comparisons to fix this.

I changed this comparison and it seems to be working now.
However, when compared to a CPU implementation, the returned indices for the values seems to be random but valid.
Probably this is an effect of the comparison order in the Cuda version.
I am not sure if this is ok since all the indices point to valid elements.

For the snippet in the issue I get the following:

```
# CUDA Values
tensor([nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan],
       device='cuda:0', dtype=torch.float64)
# CUDA indices
tensor([304, 400, 400, 528, 304, 304, 528, 336, 304, 432, 400, 280, 280, 336,
        304, 336, 400, 304, 336, 560], device='cuda:0')
```
```
# CPU values
tensor([nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan],
       dtype=torch.float64)
# CPU indices
tensor([515, 515, 515, 515, 515, 515, 515, 515, 515, 515, 515, 515, 515, 515,
        515, 515, 515, 515, 515, 515])
```

Also, maybe its better to change the `eq_with_nan` implementations to address this instead?
I am not sure if this will cause code to break in other places though ...
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38216

Differential Revision: D21517617

Pulled By: ngimel

fbshipit-source-id: deeb7bb0ac519a03aa0c5f365005a9150e6404e6
2020-05-12 18:27:14 -07:00
2c881417a7 Change input scale to double type for conv params. (#38346)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38346

Given qtensor stores scale as double, this mismatch can cause use to
repack weights everytime in QNNPACK. Worse given that we release
original weights runtime can crash.

Test Plan:
pytest test/quantization/test_quantized_module.py::TestStaticQuantizedModule::test_conv2d_api

Imported from OSS

Differential Revision: D21529384

fbshipit-source-id: 859b763dee5476e1554ebc278c5b95199a298eab
2020-05-12 18:02:22 -07:00
e3357a7812 Fix typo in build environment name (#38343)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/38343

Test Plan: `grep pytorch-linux-xenial-py3.6-gcc5.4-test .circleci -R` + CI

Differential Revision: D21537546

Pulled By: malfet

fbshipit-source-id: 4e790eaee388e51e28640b43d56ef2c07ca146c4
2020-05-12 17:54:54 -07:00
80639604a8 Revert D21536269: [pytorch][PR] [RELAND] [RELAND] .circleci: Improve docker image build workflow
Test Plan: revert-hammer

Differential Revision:
D21536269

Original commit changeset: 5577f84fa49d

fbshipit-source-id: dd824f74521595b7a0efac7ae94ce3c64df04a20
2020-05-12 17:28:34 -07:00
c2ac2127be [JIT] recursively compile class types (#38050)
Summary:
Make it so that non-nn Module classes do not need to be annotated with `torch.jit.script`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38050

Differential Revision: D21482654

Pulled By: eellison

fbshipit-source-id: 22689e4d7a33f6e1574b9495cff29a1fe6abb910
2020-05-12 17:16:28 -07:00
cdf4d42c39 [RELAND] [RELAND] .circleci: Improve docker image build workflow (#38335)
Summary:
This reverts commit 6e66e8562f276e2015af8ff76437a3f0277c4bcc.

Two things learned from the previous reland:
* `cirlceci-agent step halt` doesn't actually halt the step in place, you must explicitly exit the step after the `step halt` is called
* Even though `circleci` uses `git` to checkout repositories inside of docker images, that does not mean `git` is available after the fact.

<details>
<summary> Changes from previous reland </summary>

```patch
commit cc99a12c9029472bd73325876bc0e9dbb1746b05
Author: Eli Uriegas <eliuriegas@fb.com>
Date:   Tue May 12 10:58:18 2020 -0700

    .cirlceci: Install git for gc, exit step explicitly

    Signed-off-by: Eli Uriegas <eliuriegas@fb.com>

 diff --git a/.circleci/config.yml b/.circleci/config.yml
index 481d7889da..856a0fb10a 100644
 --- a/.circleci/config.yml
+++ b/.circleci/config.yml
@@ -2018,13 +2018,15 @@ jobs:
               export AWS_SECRET_ACCESS_KEY=${CIRCLECI_AWS_SECRET_KEY_FOR_DOCKER_BUILDER_V1}
               eval $(aws ecr get-login --no-include-email --region us-east-1)
               set -x
+              PREVIOUS_DOCKER_TAG=$(git rev-parse "$(git merge-base HEAD << pipeline.git.base_revision >>):.circleci/docker")
               # Check if image already exists, if it does then skip building it
               if docker manifest inspect "308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/${IMAGE_NAME}:${DOCKER_TAG}"; then
                 circleci-agent step halt
+                # circleci-agent step halt doesn't actually halt the step so we need to
+                # explicitly exit the step here ourselves before it causes too much trouble
+                exit 0
               fi
-              PREVIOUS_DOCKER_TAG=$(git rev-parse "$(git merge-base HEAD << pipeline.git.base_revision >>):.circleci/docker")
               # If no image exists but the hash is the same as the previous hash then we should error out here
-              # no stampeding herd effect plz.
               if [[ ${PREVIOUS_DOCKER_TAG} = ${DOCKER_TAG} ]]; then
                 echo "ERROR: Something has gone wrong and the previous image isn't available for the merge-base of your branch"
                 echo "       contact the PyTorch team to restore the original images"
 diff --git a/.circleci/ecr_gc_docker/Dockerfile b/.circleci/ecr_gc_docker/Dockerfile
index d0198acb86..36347d5e6d 100644
 --- a/.circleci/ecr_gc_docker/Dockerfile
+++ b/.circleci/ecr_gc_docker/Dockerfile
@@ -1,6 +1,6 @@
 FROM ubuntu:16.04

-RUN apt-get update && apt-get install -y python-pip && rm -rf /var/lib/apt/lists/* /var/log/dpkg.log
+RUN apt-get update && apt-get install -y git python-pip && rm -rf /var/lib/apt/lists/* /var/log/dpkg.log

 ADD requirements.txt /requirements.txt

 diff --git a/.circleci/verbatim-sources/docker_jobs.yml b/.circleci/verbatim-sources/docker_jobs.yml
index e04d11c5cd..3918cc04ae 100644
 --- a/.circleci/verbatim-sources/docker_jobs.yml
+++ b/.circleci/verbatim-sources/docker_jobs.yml
@@ -35,13 +35,15 @@
               export AWS_SECRET_ACCESS_KEY=${CIRCLECI_AWS_SECRET_KEY_FOR_DOCKER_BUILDER_V1}
               eval $(aws ecr get-login --no-include-email --region us-east-1)
               set -x
+              PREVIOUS_DOCKER_TAG=$(git rev-parse "$(git merge-base HEAD << pipeline.git.base_revision >>):.circleci/docker")
               # Check if image already exists, if it does then skip building it
               if docker manifest inspect "308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/${IMAGE_NAME}:${DOCKER_TAG}"; then
                 circleci-agent step halt
+                # circleci-agent step halt doesn't actually halt the step so we need to
+                # explicitly exit the step here ourselves before it causes too much trouble
+                exit 0
               fi
-              PREVIOUS_DOCKER_TAG=$(git rev-parse "$(git merge-base HEAD << pipeline.git.base_revision >>):.circleci/docker")
               # If no image exists but the hash is the same as the previous hash then we should error out here
-              # no stampeding herd effect plz.
               if [[ ${PREVIOUS_DOCKER_TAG} = ${DOCKER_TAG} ]]; then
                 echo "ERROR: Something has gone wrong and the previous image isn't available for the merge-base of your branch"
                 echo "       contact the PyTorch team to restore the original images"

```

</details>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38335

Differential Revision: D21536269

Pulled By: seemethere

fbshipit-source-id: 5577f84fa49dd6e1e88fce461646fd68be3d417d
2020-05-12 17:10:12 -07:00
3134978816 [JIT] Handle del statements with variables as targets (#37608)
Summary:
**Summary**
This commit modifies the JIT frontend to handle `del` statements with
variables as targets by dropping the mapping corresponding to that
variable from the environment stack maintained by the IR emitter code.

**Test Plan**
This commit adds test cases for deleting a variable, deleting a variable
and then using it, and deleting a variable in a if-statement, and then
using it.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37608

Differential Revision: D21507239

Pulled By: SplitInfinity

fbshipit-source-id: ac7e353817dc76990ece294c95965cf585d6bdfb
2020-05-12 15:17:07 -07:00
a2a53447e4 [Tensorpipe Agent] Add Call Counts to Metrics (#38266)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38266

Add the client/server active and async call counters to the
Tensorpipe Agent metrics.
ghstack-source-id: 103949985

Test Plan: CI

Reviewed By: lw

Differential Revision: D21509236

fbshipit-source-id: 66277f44d974c929a65e87bd270222d0ae27395e
2020-05-12 15:09:24 -07:00
a4466eeff4 [Tensorpipe Agent] Tracking Active Call Metrics (#38265)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38265

Tracking the active calls counts in the TensorPipe Agent:
* clientActiveCalls: running count of un-responded/un-errored RPC's sent
* serverActiveCalls: running count of un-responded RPC's received
* serverAsyncCallCount: running count of received RPC's set to be completed asynchronously
ghstack-source-id: 103949984

Test Plan: CI

Reviewed By: lw

Differential Revision: D21508957

fbshipit-source-id: 8be9dbf77ec06c138c8dd70443976d7bccee0f1e
2020-05-12 15:08:12 -07:00
3317fdf177 Updating submodules
Summary:
GitHub commits:

5a74dde371
07cfe45cd9
71650d5d67
2b81847227
8ed576f96b
8a4ba8a17f
f30ca63d3b
20021f0396
67355d562e
2f0b01b165

Test Plan: n/a

Reviewed By: zpao

fbshipit-source-id: 04350da1e88206c9029aee4c1da9393d75cbbac8
2020-05-12 14:58:12 -07:00
f954dd7823 Add dropout removal pass. (#38253)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38253

This pass removes dropout and dropout_ nodes when training is false. It
requires to have run freeze_module pass which does both inlining and constant
propagation, without which training variable remains as attribute instead of
constant.
ghstack-source-id: 103939141

Test Plan: python test/test_jit.py TestScript.test_remove_dropout

Reviewed By: dreiss

Differential Revision: D21505863

fbshipit-source-id: 42ea45804e4653b625b6a254c8d8480757264aa8
2020-05-12 14:38:34 -07:00
8ab6377273 Port atan from TH to ATen (#37991)
Summary:
Fixed https://github.com/pytorch/pytorch/issues/24538
Related https://github.com/pytorch/pytorch/issues/24507
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37991

Differential Revision: D21531741

Pulled By: VitalyFedyunin

fbshipit-source-id: c762cc80416d7fffbb1769c6cc5e0914ceaa8e2d
2020-05-12 14:22:26 -07:00
d5a7d790a1 Use torch.ne instead of torch.nonzero in gradcheck (#37857)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/37857

Test Plan: Imported from OSS

Differential Revision: D21528484

Pulled By: anjali411

fbshipit-source-id: 2c43b4e4d484a943210dd9426c2e3ac1c30c8084
2020-05-12 13:45:45 -07:00
7c13a07286 [Reland] Remove uses of type() part 2 (#38288)
Summary:
Reland of https://github.com/pytorch/pytorch/issues/38140. It got reverted since it broke slow tests which were only run on master branch(thanks mruberry !). Enabling all CI tests in this PR to make sure they pass.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38288

Reviewed By: mruberry

Differential Revision: D21524923

Pulled By: ailzhang

fbshipit-source-id: 3a9ecc7461781066499c677249112434b08d2783
2020-05-12 13:37:14 -07:00
b6d494d6da [future] Minor: std::move() callback in future for the convenience operator case. (#37861)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37861

ghstack-source-id: 103511066

Test Plan: buck test mode/dev-nosan caffe2/test/...

Differential Revision: D21409650

fbshipit-source-id: 6c501963f73590512e426f6806f4530aad618b1a
2020-05-12 13:35:41 -07:00
525295e696 BC upgrader for dynamic Linear with torchbind (#38333)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/38333

Test Plan:
Imported from OSS

```
manifold get fblearner_inference_platform_models/tree/149745959/1/149745959_1.000 /tmp/149745959_1
```
In python:
```
import torch
torch.jit.load('/tmp/149745959_1')
```

` buck test mode/dev-nosan //caffe2/torch/fb/predictor/model_repo/tests:model_loading_test`  passes

Reviewed By: ailzhang

Differential Revision: D21527444

Pulled By: jamesr66a

fbshipit-source-id: b33cab29df2d68beb482a044e604e41683e4fef6
2020-05-12 13:30:40 -07:00
906c50eb69 Remove dead code in ddp.{h, cpp} (#37990)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37990

The code in `ddp.{h, cpp}` and the corresponding pybind implementations are no longer used. The pybinded calls were all private APIs and only ran in unittests, so we should remove these unused APIs.

https://github.com/pytorch/pytorch/pull/20234 from a year ago also mentioned that we should delete `_dist_broadcast_coalesced`

Verified that all tests pass with cuda by running `test_c10d` on a gpu-enabled machine.
ghstack-source-id: 103885383

Test Plan: CI

Differential Revision: D21443879

fbshipit-source-id: 764d8681ca629056bfe2c260ffab47fa5bdf07ff
2020-05-12 12:41:09 -07:00
6daaeb2bda [pytorch] Add C++ error when PyTorch used with Python 2
Summary: Python 2 has reached end-of-life and is no longer supported by PyTorch. To avoid confusing behavior when trying to use PyTorch with Python 2, detect this case early and fail with a clear message in C++.

Test Plan: waitforsandcastle

Reviewed By: orionr

Differential Revision: D21043062

fbshipit-source-id: ab448d2888f5048a0180598b882adfc67e31d851
2020-05-12 12:33:47 -07:00
a90e574401 Enable linear/conv + relu fusion in mobile optimizer. (#38139)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38139

As title says.

Test Plan: mobile optimizer test.

Reviewed By: AshkanAliabadi

Differential Revision: D21479880

fbshipit-source-id: 07b47cd620ee8af4dbe3c98bd94924b159c1406f
2020-05-12 12:28:26 -07:00
82abd50f2b Added more autograd tests for C->C complex functions (#37856)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/37856

Test Plan: Imported from OSS

Differential Revision: D21528455

Pulled By: anjali411

fbshipit-source-id: d18b546cac3aae11c1cda748df56dbf5aeca66b8
2020-05-12 12:19:10 -07:00
291869d625 Remove unnecessary RPC profiling code after future merge (#38255)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38255

Now that the futures are consolidated after
https://github.com/pytorch/pytorch/pull/35154, there is no
`torch.distributed.rpc.Future` and we do not need a special path. All futures
can now be profiled through the use of the jit operator defined in
record_function_ops.cpp

As a result, we also get rid of the record_function_ops.h file.
RPC profiling tests are currently disabled, although I re-enabled them locally
to ensure that they still work with this change.
ghstack-source-id: 103869855

Test Plan: CI

Differential Revision: D21506091

fbshipit-source-id: ad68341c9f2eab2dadc72fe6a6c59b05693434f2
2020-05-12 12:03:16 -07:00
7c66ad8941 [caffe2/fakelowp] fix bug in ref code (#38331)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38331

C_ref size was wrong

Test Plan: CI

Reviewed By: hyuen

Differential Revision: D21525639

fbshipit-source-id: 59f4709238cdd46bb38f7c534335eb79229f6c7f
2020-05-12 11:58:41 -07:00
779abf7538 Implements torch.pow for complex on cuda and enables complex values as exponents for pow (#36793)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/36744

It also allows to call pow on the cpu with complex values as exponent, which was not possible before.

TODO: Add tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36793

Differential Revision: D21525514

Pulled By: anjali411

fbshipit-source-id: c4624c97b194cb1d942e5dd0ee9042adf7586ed3
2020-05-12 11:28:44 -07:00
986d7e47c4 Migrate CPU fill kernel to c10::complex (#38026)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/38026

Test Plan: Imported from OSS

Differential Revision: D21518318

Pulled By: anjali411

fbshipit-source-id: 0bbf47f53a7aad619d5a3e22f7ba875dc007b881
2020-05-12 11:15:34 -07:00
d5e8d90a2c Migrate CPU reduction to c10::complex (#38022)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/38022

Test Plan: Imported from OSS

Differential Revision: D21518270

Pulled By: anjali411

fbshipit-source-id: 382845a4d966fcdcb416895341502a09e378e57c
2020-05-12 11:10:10 -07:00
96d2ddba6c remove harcoded values for fc testing
Summary: removing hard coded dimensions

Test Plan: ran the test itself

Reviewed By: jspark1105, amylittleyang

Differential Revision: D21520255

fbshipit-source-id: a75043103c61b91b8f10f405abff4790292e92c4
2020-05-12 11:05:11 -07:00
b29ec43555 Limit max numel for test tensors (#38304)
Summary:
Add `max_numel` option to `hypothesys_utils.array_shapes`
Use it to limit tensor element count to 100K for tensors whose maximum number of elements can exceed 250K
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38304

Differential Revision: D21525483

Pulled By: malfet

fbshipit-source-id: fac132dc7274b9417141b708cc9535561a95fcb3
2020-05-12 10:46:00 -07:00
9576b37caf Fix test_channel_shuffle hypothesis params (#38327)
Summary:
Otherwise, zero-point can be out of range, if selected type is torch.qint8
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38327

Differential Revision: D21525214

Pulled By: malfet

fbshipit-source-id: 989f58f79830ec7f616a68f0ab00661b15030062
2020-05-12 09:37:09 -07:00
7eb9f1788c Using LoadLibraryEX [Reland] (#38302)
Summary:
This reverts commit 1ab4f35499aa933677152aca6a1ba2cbe86639f8.

Without this PR, the OS try to find the DLL in the following directories.
- The directory from which the application loaded.
- The system directory. Use the GetSystemDirectory function to get the path of this directory.
- The 16-bit system directory. There is no function that obtains the path of this directory, but it is searched.
- The Windows directory. Use the GetWindowsDirectory function to get the path of this directory.
- The current directory.
- The directories that are listed in the PATH environment variable. Note that this does not include the per-application path specified by the App Paths registry key. The App Paths key is not used when computing the DLL search path.

If we use  LoadLibraryEx with LOAD_LIBRARY_SEARCH_* flags, the directories are searched in the following order.

- The directory that contains the DLL (LOAD_LIBRARY_SEARCH_DLL_LOAD_DIR). This directory is searched only for dependencies of the DLL to be loaded.
- The application directory (LOAD_LIBRARY_SEARCH_APPLICATION_DIR).
- Paths explicitly added to the application search path with the AddDllDirectory function (LOAD_LIBRARY_SEARCH_USER_DIRS) or the SetDllDirectory function. If more than one path has been added, the order in which the paths are searched is unspecified.
- The System32 directory (LOAD_LIBRARY_SEARCH_SYSTEM32).

Advantages:
1. The directory that contains the DLL comes first and it's desirable for us, because the dependencies in `lib` should always be preferred.
2. The system directory is considered in the last place. According to some of the bug reports, the DLL load failure are caused by loading the conflicting ones in systemroot.

Neural:
1. The directories in `PATH` are not considered. Similar things happen as described in the previous point. So it may be beneficial for normal users. However, it may cause failures if there are some new dependencies if built from source. (Resolved by making the fallback to `LoadLibraryW` if error code is `126`)

Disadvantages:
1. LoadLibraryEx with LOAD_LIBRARY_SEARCH_* flags is only available for Win7/2008 R2 + KB2533623 and up. (Resolved by making the fallback to `LoadLibraryW` if it is not supported)
2. Failure during the call of `LoadLibraryEx` will lead to the OS to pop up a modal dialog, which can block the process if user is using a CLI-only interface. This can be switched off by calling `SetErrorMode`. (Resolved by calling `SetErrorMode`)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38302

Test Plan:
Test some common cases (in a new repo maybe) including
1. Python 3.6/3.7/3.8, conda python, conda install
2. Python 3.6/3.7/3.8, conda python, pip install
3. Python 3.6/3.7/3.8, official python, pip install
Plus some corner cases like
1. Conflicting DLLs in systemroot or `PATH`
2. Remove some local dependencies and use global ones

References:
1. https://docs.microsoft.com/en-us/windows/win32/api/errhandlingapi/nf-errhandlingapi-seterrormode
2. https://docs.microsoft.com/en-us/windows/win32/api/libloaderapi/nf-libloaderapi-loadlibraryexa
3. https://docs.microsoft.com/en-us/windows/win32/dlls/dynamic-link-library-search-order#standard-search-order-for-desktop-applications

Differential Revision: D21524090

Pulled By: malfet

fbshipit-source-id: 0cf5e260c91759b0af8c7aa0950a488e3b653ef5
2020-05-12 09:31:43 -07:00
6bb1c4a7ab Move (most) generated return statements for TH functions out of the switch. (#38073)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38073

Most of the generated return statements don't depend on the scalar type and it saves ~900 lines of generated code.

Test Plan: Imported from OSS

Differential Revision: D21476010

Pulled By: gchanan

fbshipit-source-id: 3fcc4db466d697c90abafb9da6c3f3644621810b
2020-05-12 09:19:09 -07:00
e3584f8d7e Migrate CPU tensor factories to c10::complex (#38021)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/38021

Test Plan: Imported from OSS

Differential Revision: D21518263

Pulled By: anjali411

fbshipit-source-id: 9d7769357cf51d3f71d8833fa9ca108a1a97e9cd
2020-05-12 07:32:40 -07:00
4c99a9b672 Add documentation for hardswish (#37989)
Summary:
Fix issue https://github.com/pytorch/pytorch/issues/37431.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37989

Differential Revision: D21502182

Pulled By: zou3519

fbshipit-source-id: 245586fb555f7f1d9ec8d87269035b6fe626b47b
2020-05-12 06:48:51 -07:00
ba0851326c Revert D21449462: [CUDA] addmv for complex tensors
Test Plan: revert-hammer

Differential Revision:
D21449462

Original commit changeset: 1f2dd5a7f8a4

fbshipit-source-id: 4f5f035668d1de4469d11ddeb08a77340eb52f98
2020-05-12 05:21:11 -07:00
5c44f2a16b Updating submodules
Summary:
GitHub commits:

7bb3cd718a

Test Plan: n/a

Reviewed By: zpao

fbshipit-source-id: 6b2332b2db9a27c063c4327671a7e921f85621e5
2020-05-12 01:46:07 -07:00
cf82011361 Codegen CircleCI Windows configs (#38292)
Summary:
This is a step toward re-automating most of the CircleCI `config.yml` generation so that it can be safely refactored into multiple `workflow`s.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38292

Differential Revision: D21519337

Pulled By: kostmo

fbshipit-source-id: 09cc4f97ac52f37ef6d8a6fb8f49eeead052b446
2020-05-11 22:16:43 -07:00
3a63728149 [caffe2/fakelowp] optimize ref int8 gemm (#38294)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38294

Optimize the reference int8 gemm using avx2 intrinsics

Test Plan:
Before this diff
7.72164 GF/s

After this diff
27.7731 GF/s

Reviewed By: amylittleyang

Differential Revision: D21516439

fbshipit-source-id: 2b596605eec6a338a295701a01cf2c8639204274
2020-05-11 22:04:12 -07:00
dad552666e Add then(callback)->Future API to ivalue::Future (#37311)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/37311

Test Plan: Imported from OSS

Differential Revision: D21247827

Pulled By: mrshenli

fbshipit-source-id: f8fe0617ccb957aa747a78554a000ce2c4a58495
2020-05-11 21:58:56 -07:00
dcf1861f88 add document for bucktization (#38119)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38119

This is for (#37435).
Demo is here:
https://glaringlee.github.io/generated/torch.searchsorted.html
https://glaringlee.github.io/generated/torch.bucketize.html

Test Plan: Imported from OSS

Differential Revision: D21517392

Pulled By: glaringlee

fbshipit-source-id: b35795c7f07e9ae4c4806c528eb51fd4ca14d499
2020-05-11 21:54:19 -07:00
0d977e9223 [CUDA] addmv for complex tensors (#37940)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/37940

Test Plan: Imported from OSS

Differential Revision: D21449462

Pulled By: anjali411

fbshipit-source-id: 1f2dd5a7f8a42d3ba92a1b1a286f35454392a06d
2020-05-11 21:46:52 -07:00
63c3b89c1c Simplify code with decltype(auto) (#30922)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30922

New c++14 feature we can use now
ghstack-source-id: 103767403

Test Plan: waitforsandcastle

Differential Revision: D18869644

fbshipit-source-id: 54541c8004b2116386668a31eb9b0410a603b7dc
2020-05-11 21:31:18 -07:00
6943253421 [quant][mobile] Don't release bias tensor (#38284)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38284

Bias is used to calculate out channels

Test Plan: Imported from OSS

Differential Revision: D21515997

fbshipit-source-id: 5fe5ddd4c7ce5cc49d15c477b744994a3db5fc89
2020-05-11 21:23:48 -07:00
09e4ff95ee [quant][mobile] Ensure qconv doesn't assert with empty batch (#38252)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38252

Return empty batch output if input has empty batch on mobile.

Test Plan:
python test/test_quantization.py TestQNNPackOps.test_qconv_empty_batch

Imported from OSS

Differential Revision: D21515998

fbshipit-source-id: 1eab4710f4c21d06521e1a172f9bc708dbaeb3c0
2020-05-11 21:22:06 -07:00
ec7beda822 Use thrust::host_vector instead of std::vector (#38178)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/38024.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38178

Differential Revision: D21502379

Pulled By: ezyang

fbshipit-source-id: 74dd6504c56f4150ed4cef129fd3f32f378c0564
2020-05-11 20:34:04 -07:00
cebf5a8767 Run mypy on some test files, add iinfo/finfo annotations (#38220)
Summary:
Most test files have a ton of errors; there's not much point adding ignores for them though. The way of working is simply to run `mypy test/test_somefile.py`, fix up the errors, then add that file to the `files =` list in `mypy.ini`.

Can't add all of `test/*` by default, because the JIT test files have (on purpose) syntax errors that are meant to exercise the robustness of the JIT to bad annotations. Leave those alone for now.

_Depends on the ghstacked PRs in gh-38173, only the last 2 commits are new._
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38220

Differential Revision: D21503481

Pulled By: ezyang

fbshipit-source-id: 63026e73201c549d64647a03a20a4c6687720244
2020-05-11 20:18:41 -07:00
6e66e8562f Revert D21517822: [pytorch][PR] [RELAND] .circleci: Improve docker image build workflow
Test Plan: revert-hammer

Differential Revision:
D21517822

Original commit changeset: 5f705f6c617c

fbshipit-source-id: a7ee422abdb1e966c267f62c45c73f4b4cb45b57
2020-05-11 20:14:30 -07:00
bf499cccb6 Refactor native/cpu/zmath.h (#38037)
Summary:
There is now a `zmath.h` and `zmath_std.h`, where the latter is the copy-paste of the original `zmath.h` and supporting `std::complex`, and `zmath.h` is for supporting `c10::complex`. `zmath_std.h` will be removed eventually.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38037

Differential Revision: D21518177

Pulled By: anjali411

fbshipit-source-id: 18552e955dc31f95870f34962d709de0444804f6
2020-05-11 20:09:44 -07:00
375ddb01b5 Fix tensor printing (#38031)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/38031

Test Plan: Imported from OSS

Differential Revision: D21502915

Pulled By: anjali411

fbshipit-source-id: 0cc3017a390da55af47ba81f651a883cd52b10da
2020-05-11 19:59:19 -07:00
eea9c6a048 [RELAND] .circleci: Improve docker image build workflow (#38279)
Summary:
This reverts commit 7c2853be9dccc7a1ae80a2a421e63e254cd7797c.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38279

Differential Revision: D21517822

Pulled By: seemethere

fbshipit-source-id: 5f705f6c617cbc77a10ab9deb913bc5958ae7439
2020-05-11 19:38:51 -07:00
43dd8760d7 Move ThreadLocalDebugInfo to c10 (#37774)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37774

Move ThreadLocalDebugInfo from ATen to C10

Test Plan: Imported from OSS

Differential Revision: D21384249

Pulled By: ilia-cher

fbshipit-source-id: f9b5089a868f84a2ee013695a481fcc883d3c6b2
2020-05-11 19:27:41 -07:00
6968c8153e Warn against callOp (#37797)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37797

This is slow (see comment in code).
Not fixing this yet, but at least adding a warning so people are aware and don't add new call sites.
ghstack-source-id: 103887226

Test Plan: waitforsandcastle

Differential Revision: D21390364

fbshipit-source-id: 7bff1c3b9756a16c9d9110f209c23bf557266dda
2020-05-11 19:21:50 -07:00
42a222cf2c DOC: Add missing args for index_add (#38213)
Summary:
Fix https://github.com/pytorch/pytorch/issues/37752 by updating `index_add`documentation as suggested by danpovey
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38213

Reviewed By: ilia-cher

Differential Revision: D21506728

Pulled By: ngimel

fbshipit-source-id: 3c08bc3743cd4ba8c0c97b7d359d35e82f0127ac
2020-05-11 18:37:25 -07:00
cdd1b9a891 [TensorExpr] Distinguish aten::max reduction op from aten::max elementwise op and only fuse the latter. (#38171)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/38171

Test Plan: Imported from OSS

Differential Revision: D21487389

Pulled By: ZolotukhinM

fbshipit-source-id: ac28789bf2bea389f560de4d5b979e036295e96a
2020-05-11 17:45:59 -07:00
21ce4333b9 Remove THFile, THDiskFile, and THMemoryFile (#37830)
Summary:
Fix https://github.com/pytorch/pytorch/issues/36996

Remove seemingly unused files.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37830

Differential Revision: D21431969

Pulled By: mruberry

fbshipit-source-id: 824fa86f03ce0049ceddb093468800154ae8048b
2020-05-11 17:18:35 -07:00
1ab4f35499 Revert D21496081: [pytorch][PR] Using LoadLibraryEx and LOAD_LIBRARY_SEARCH_* flag for loading DLLs o…
Test Plan: revert-hammer

Differential Revision:
D21496081

Original commit changeset: aa5e528e5134

fbshipit-source-id: c0636b06dd65c7419018062f79aabc397fb2c5b8
2020-05-11 16:38:37 -07:00
f41833957d bypass getDeviceFromPtr check when device is known (#36714)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/36594

In some cases, when using memory that was allocated in another process before doing any memory-related operation in PyTorch, there are errors because the GPU CUDA context is not completely initialized.

I guess there is an explicit reason to leave the context not initialized at first, and don't do it in `THCudaInit` where other CUDA calls are going on.
I'd like to discuss it in this PR.

Possible better solutions are
Initialize the device context in `fromDLPack` or `from_blob`, probably by creating some dummy array with one element. But this feels like a hack.

Another possibility is to catch the exception in `getDeviceFromPtr`, check if the context was initialized, and if not repeat this operation. but we will need to check for every device.

This PR bypasses the `getDeviceFromPtr` call which is the one causing the problem if we already know the device. This allows us to create the Tensor from the shared memory storage but the context will not be initialized. However, it will be when the tensor is accessed later.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36714

Differential Revision: D21504557

Pulled By: ngimel

fbshipit-source-id: 173ccdeb7c2a2b0ece53dd50be97f2df577a5634
2020-05-11 16:20:23 -07:00
8e07b75cef Have DeviceType available in torch namespace (#38036)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38036

Resolves: https://github.com/pytorch/pytorch/issues/36946

Test Plan: Imported from OSS

Differential Revision: D21463610

Pulled By: anjali411

fbshipit-source-id: c4aabfac2cd1f05f8b66745aae0a17c2af4d9c9b
2020-05-11 16:06:52 -07:00
7c2853be9d Revert D21511048: [pytorch][PR] .circleci: Improve docker image build workflow
Test Plan: revert-hammer

Differential Revision:
D21511048

Original commit changeset: e4b153a6078e

fbshipit-source-id: 09ad9ad9b108479cba44070c82182dd91fd4f099
2020-05-11 15:52:03 -07:00
333e29c45f [ONNX] Fix pow op export (#38065)
Summary:
Fix pow type cast for opset 9 and update opset 12
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38065

Differential Revision: D21485353

Pulled By: malfet

fbshipit-source-id: 3993e835ffad07b2e6585eb5cf1cb7c8474de2ec
2020-05-11 15:46:44 -07:00
19d6e32e9a fix sample code (#38002)
Summary:
Make Linear layer working correct when bias is False
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38002

Differential Revision: D21509679

Pulled By: malfet

fbshipit-source-id: c7077992cf414ecc557b39e5ed1e39ef01c8b347
2020-05-11 15:34:09 -07:00
def9f15b57 .circleci: Improve docker image build workflow (#37976)
Summary:
closes https://github.com/pytorch/pytorch/issues/37855

## .circleci: Improve docker image build workflow

Improves the docker image build workflow from many steps to basically
transparent from a user's perspective.

To update docker images now all one has to do is edit the
.circleci/docker folder and it will update automatically and also
dynamically add the tags to the list of tags to keep from the garbage
collector.

Adding a new image will currently stay the same but we can explore doing
that dynamically as well.

### How the build workflow works:
  - Docker tags are determined by the hash defined from git for the
    .circleci/docker sub-directory (extracted using git rev-parse)
  - Images are only built if the computed hash is not found in ecr and
    the hash is different than the previously computed hash. The
    previously computed hash is found using the same process as before
    but subbing out HEAD for the merge base between HEAD and the base
    git revision
  - That tag is then passed through the jobs using a shared workspace
    which is added to downstream jobs using the circleci ${BASH_ENV}

### How the new garbage collection works:
  - Tags to keep are generated by stepping through all of the commits in
    in the .circleci/docker subdirectory

Signed-off-by: Eli Uriegas <eliuriegas@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37976

Differential Revision: D21511048

Pulled By: seemethere

fbshipit-source-id: e4b153a6078e3875f6cfa03a903b2e951d803cce
2020-05-11 15:25:14 -07:00
a37b865107 test_linspace : remove explicit for-loop (#38191)
Summary:
Reference : https://github.com/pytorch/pytorch/issues/38187

Benchmark with same build settings on same system.
gcc : version 7.5.0 (Ubuntu 7.5.0-3ubuntu1~18.04)
CPU : Intel® Core i5-8300H CPU @ 2.30GHz × 8
GPU : GTX 1050ti

Test Cmd : `pytest test/test_torch.py -k linspace_cpu_float`

Before :
```
test/test_torch.py ..                                                                                                                                                         [100%]

======================================================================== 2 passed, 5170 deselected in 24.43s ========================================================================
```

After :
```
test/test_torch.py ..                                                                                                                                                         [100%]

======================================================================== 2 passed, 5170 deselected in 9.20s =========================================================================
```

Test Cmd : `pytest test/test_torch.py -k linspace_cuda_float`

Before :
```
test/test_torch.py ......                                                                                                                                                     [100%]

=================================================================== 6 passed, 5166 deselected in 83.84s (0:01:23) ===================================================================
```

After :
```
test/test_torch.py ......                                                                                                                                                     [100%]

======================================================================== 6 passed, 5166 deselected in 40.18s ========================================================================
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38191

Differential Revision: D21494478

Pulled By: mruberry

fbshipit-source-id: fa58f727781425937a7b8212f9b63a739935eb86
2020-05-11 15:17:47 -07:00
c6b2844076 Pin flake8 to 3.7.9 (#38269)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/38269

Test Plan: Imported from OSS

Differential Revision: D21510318

Pulled By: mrshenli

fbshipit-source-id: ac57a0ffed7401c13b7983b8685a8706b8181142
2020-05-11 15:08:36 -07:00
a553935e3c [JIT] Expose magic methods on script::Object (#38167)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/38167

Test Plan: Imported from OSS

Differential Revision: D21486709

Pulled By: jamesr66a

fbshipit-source-id: 17b44d979fc658768b0d64f7d8af6fb684043ea3
2020-05-11 15:01:15 -07:00
1456515f15 [JIT] Disallow plain List type annotation without arg (#38130)
Summary:
**Summary**
This commit detects and prohibits the case in which `typing.List` is
used as an annotation without a type argument (i.e. `typing.List[T]`).
At present, `typing.List` is always assumed to have one argument, and
when it is used without one, `typing.List.__args__[0]` is nonempty and
set to some `typing.TypeVar` instance, which has no JIT type equivalent.
Consequently, trying to convert `typing.List` to a JIT type results in
a `c10::ListType` with `nullptr` for its element type, which can cause
a segmentation fault.

This is fixed by returning a `ListType` from
`jit.annotations.try_ann_to_type` only if the element type is converted
successfully to a JIT type and returning `None` otherwise.

**Test Plan**
I ran the code from the issue (https://github.com/pytorch/pytorch/issues/37530) that reported this problem and also ran some unit tests.

*Before*
```
$ python3 segfault.py
Segmentation fault (core dumped)
```

*After*
```
$ python3 segfault.py
Traceback (most recent call last):
...
RuntimeError:
Unknown type name 'List':
  File "segfault.py", line 9
    classmethod
    def cat(cls, box_lists: List):
                            ~~~~ <--- HERE
        return cls(torch.cat([x for x in box_lists]))
'Boxes.cat' is being compiled since it was called from 'Boxes'
  File "segfault.py", line 13
def f(t: torch.Tensor):
    b = Boxes(t)
        ~~~~~ <--- HERE
    c = Boxes(torch.tensor([3, 4]))
    return Boxes.cat([b, c])
'Boxes' is being compiled since it was called from 'f'
  File "segfault.py", line 13
def f(t: torch.Tensor):
    b = Boxes(t)
    ~~~~~~~~~~~ <--- HERE
    c = Boxes(torch.tensor([3, 4]))
    return Boxes.cat([b, c])
```

**Fixes**
This pull request fixes https://github.com/pytorch/pytorch/issues/37530.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38130

Differential Revision: D21485284

Pulled By: SplitInfinity

fbshipit-source-id: 9b51ef6340485a24c8b7cfb85832d4668b8ac51a
2020-05-11 14:15:54 -07:00
00f3790a9d Using LoadLibraryEx and LOAD_LIBRARY_SEARCH_* flag for loading DLLs o… (#37763)
Summary:
…n Windows

Without this PR, the OS try to find the DLL in the following directories.
- The directory from which the application loaded.
- The system directory. Use the GetSystemDirectory function to get the path of this directory.
- The 16-bit system directory. There is no function that obtains the path of this directory, but it is searched.
- The Windows directory. Use the GetWindowsDirectory function to get the path of this directory.
- The current directory.
- The directories that are listed in the PATH environment variable. Note that this does not include the per-application path specified by the App Paths registry key. The App Paths key is not used when computing the DLL search path.

If we use  LoadLibraryEx with LOAD_LIBRARY_SEARCH_* flags, the directories are searched in the following order.

- The directory that contains the DLL (LOAD_LIBRARY_SEARCH_DLL_LOAD_DIR). This directory is searched only for dependencies of the DLL to be loaded.
- The application directory (LOAD_LIBRARY_SEARCH_APPLICATION_DIR).
- Paths explicitly added to the application search path with the AddDllDirectory function (LOAD_LIBRARY_SEARCH_USER_DIRS) or the SetDllDirectory function. If more than one path has been added, the order in which the paths are searched is unspecified.
- The System32 directory (LOAD_LIBRARY_SEARCH_SYSTEM32).

Advantages:
1. The directory that contains the DLL comes first and it's desirable for us, because the dependencies in `lib` should always be preferred.
2. The system directory is considered in the last place. According to some of the bug reports, the DLL load failure are caused by loading the conflicting ones in systemroot.

Neural:
1. The directories in `PATH` are not considered. Similar things happen as described in the previous point. So it may be beneficial for normal users. However, it may cause failures if there are some new dependencies if built from source. (Resolved by making the fallback to `LoadLibraryW` if error code is `126`)

Disadvantages:
1. LoadLibraryEx with LOAD_LIBRARY_SEARCH_* flags is only available for Win7/2008 R2 + KB2533623 and up. (Resolved by making the fallback to `LoadLibraryW` if it is not supported)
2. Failure during the call of `LoadLibraryEx` will lead to the OS to pop up a modal dialog, which can block the process if user is using a CLI-only interface. This can be switched off by calling `SetErrorMode`. (Resolved by calling `SetErrorMode`)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37763

Test Plan:
Test some common cases (in a new repo maybe) including
1. Python 3.6/3.7/3.8, conda python, conda install
2. Python 3.6/3.7/3.8, conda python, pip install
3. Python 3.6/3.7/3.8, official python, pip install
Plus some corner cases like
1. Conflicting DLLs in systemroot or `PATH`
2. Remove some local dependencies and use global ones

References:
1. https://docs.microsoft.com/en-us/windows/win32/api/errhandlingapi/nf-errhandlingapi-seterrormode
2. https://docs.microsoft.com/en-us/windows/win32/api/libloaderapi/nf-libloaderapi-loadlibraryexa
3. https://docs.microsoft.com/en-us/windows/win32/dlls/dynamic-link-library-search-order#standard-search-order-for-desktop-applications

What do you think, malfet ezyang ?

Differential Revision: D21496081

Pulled By: malfet

fbshipit-source-id: aa5e528e5134326b00ac98982f4db4b4bbb47a44
2020-05-11 14:02:03 -07:00
5f9b9036c1 Add instance methods tensor.isnan(), tensor.isinf(), tensor.isfinite() (#37942)
Summary:
Fix https://github.com/pytorch/pytorch/issues/37736
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37942

Differential Revision: D21503150

Pulled By: soumith

fbshipit-source-id: cf6bf57ca67013efe119543f3d9a698473960dec
2020-05-11 13:56:59 -07:00
5137827ad0 Lazily initialise thread local num_threads value (#37461)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/37259, fixes https://github.com/pytorch/pytorch/issues/20156

This lazily calls `at::init_num_threads` once for each thread by adding a call to `lazy_init_num_threads` in `at::parallel_for` and `at::parallel_reduce`.

If this solution is okay, then we should add the same to guard other places that might use MKL or OpenMP.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37461

Reviewed By: ezyang

Differential Revision: D21472763

Pulled By: ilia-cher

fbshipit-source-id: 889d6664f5bd4080037ade02ee324b1233992915
2020-05-11 13:24:45 -07:00
08c3339e7c [pyfi] override TP2 networkx -> PyFI networkx (#37764)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37764

Auto-generated diff for TP2->PyFI migration.

```
networkx
  TP2 version: 2.0
  PyFI active wheels (networkx):
    py2-darwin           -> 2.3
    py2-platform007      -> 2.2
    py3-darwin           -> 2.3
    py3-platform007      -> 2.3
    py3.7-platform007    -> 2.3
```

#buildmore

excited_python

Test Plan: buildallthethings

Reviewed By: thatch

Differential Revision: D19790867

fbshipit-source-id: d6f893beee794df5408a5117978b534cafc6ec83
2020-05-11 13:20:00 -07:00
c31913671c DOC: add BFloat16 dtype and BFloat16Tensor (#37051)
Summary:
Related to gh-36318

Mention `bfloat16` dtype and `BFloat16Tensor` in documentation. The real fix would be to implement cpu operations on 16-bit float `half`, and I couldn't help but notice that `torch.finfo(torch.bfloat16).xxx` crashes for `xxx in ['max', 'min', 'eps']`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37051

Differential Revision: D21476851

Pulled By: ngimel

fbshipit-source-id: fef601d3116d130d67cd3a5654077f31b699409b
2020-05-11 12:44:46 -07:00
b290da0e75 Migrate CPU tril, triu, masked_fill to c10::complex (#37897)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/37897

Test Plan: Imported from OSS

Differential Revision: D21442181

Pulled By: anjali411

fbshipit-source-id: 609af9086da1b622db51694f65eadfebe3970cfd
2020-05-11 12:27:46 -07:00
77d8a44802 If we're building on C++17, use actual "if constexpr" (#38154)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38154

This should give better error messages and shorter stack traces on C++17 builds (e.g. fbcode)
ghstack-source-id: 103775564

Test Plan: waitforsandcastle

Differential Revision: D21483327

fbshipit-source-id: 184d1f9c0543bf43dc9713fa97fcc5955e7be319
2020-05-11 12:22:19 -07:00
3569c59600 Inverse logic of persistent set and prevent use in jit (#38131)
Summary:
jit.ScriptModule deletes all the actual attributes but still uses the nn.Module implementation.
Since I don't know how to add this new set() to the ScriptModule, it is simpler to just raise a nice error for now.
I also reverted the logic so that an empty set() (which is always the case in a ScriptModule) means that everything is persistent.

cc zdevito should we open an issue to add this to the ScriptModule?
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38131

Differential Revision: D21502183

Pulled By: albanD

fbshipit-source-id: 96f83098d9a2a9156e8af5bf5bd3526dd0fefc98
2020-05-11 09:59:24 -07:00
f314d9a077 Remove codegen for IntArrayRefStride, which isn't used. (#38072)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/38072

Test Plan: Imported from OSS

Reviewed By: zou3519

Differential Revision: D21476008

Pulled By: gchanan

fbshipit-source-id: 14e5cc7d7e3412e4aca897adcd3653b86dcfaff4
2020-05-11 08:46:29 -07:00
fe53b52537 Macro generate ScalarTypeToCPPType, including all ScalarTypes. (#38071)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38071

Fixes: https://github.com/pytorch/pytorch/issues/34826

Test Plan: Imported from OSS

Reviewed By: zou3519

Differential Revision: D21476009

Pulled By: gchanan

fbshipit-source-id: 96fa9d81f9581179c674e6af2dd903930c8def68
2020-05-11 08:46:24 -07:00
c26dde967c Kill resize-ing and zero-ing from codegen. (#37958)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37958

All codegen invocations have been removed at this point, so this has no effect.

Test Plan: Imported from OSS

Reviewed By: mrshenli

Differential Revision: D21433215

Pulled By: gchanan

fbshipit-source-id: 1f58f3022fab6443e34f0201ae4b32b2a99725cf
2020-05-11 08:45:02 -07:00
ebad4e463f add missing include file for fake_nnpi_ops_utils (#38215)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38215

add missing include file

Test Plan: tested fix on OSS

Reviewed By: yinghai

Differential Revision: D21498580

fbshipit-source-id: cf6c021738b4a93563fdb98e29502dba5898989d
2020-05-11 08:36:57 -07:00
6edf340338 Delete torch/__init__.pyi, deferring to direct extension stubs (#38157)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38157

This removes the error prone process of assembling `torch/__init__.pyi`
(and frequently forgetting to expose things), since now we can simply
rely on the true source file to get things done.  Most of the old
codegen in gen_pyi.py is now rerouted to various files:

- `torch/_C/__init__.pyi` (the dumping pile of all misc bindings)
- `torch/_C/_nn.pyi` (NN function bindings)
- `torch/_C/_VariableFunctions.pyi` (torch function bindings)

`torch.types` grew a bunch more definitions that previously where
defined in `torch/__init__.pyi`

Some miscellaneous changes

- Fixed a bug where we treat single TensorList argument as implying
  varargs are accepted. This is actually only supported on IntList.
  This means we can correctly generate a stub for dequantize.
- Add missing manual stub for nonzero
- Switched torch/onnx/operators.py to directly refer to _C module,
  since apparently mypy doesn't think that methods prefixed with
  underscores get reexported.  This may be a recurring theme; maybe
  we need to find a better way to solve it.

Because I was really lazy, I dumped namedtuple definitions in both
`torch._C` and `torch._C._VariableFunctions`.  This is definitely wrong.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Differential Revision: D21497400

Pulled By: ezyang

fbshipit-source-id: 07b126141c82efaca37be27c07255cb2b9b3f064
2020-05-11 07:20:13 -07:00
6f396e18c3 Add per-device allocator object in CUDACachingAllocator (#37567)
Summary:
Reduces lock contention and BlockPool management costs by tracking applicable state in per-device structures.

`THCCachingAllocator` now maintains a set of `DeviceCachingAllocator` objects (one per device) each of which maintains its own allocator state and operations.

Only global state remains in the top-level THCCachingAllocator object -- namely, `allocated_blocks`, the mapping between the raw storage pointers and the allocator's underlying Block structure.  Global operations deal mostly with this translation and then pass the bulk of the work on to the device-specific allocator.

Conversely, device-specific state and operations are comprised mostly of managing the device's underlying blocks.

This has the following benefits:

- Performance: Access to the global pointer map is serialized independently of the per-device state -- reducing lock contention between operations on different devices.

- Simplicity: Managing the block pools in separate device-specific objects is conceptually more intuitive, simplifies the code and makes certain operations more efficient -- even in the absence of contention (e.g. free_cached_blocks, synchronize_and_free_events, emptyCache, get_all_blocks, etc.)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37567

Differential Revision: D21458556

Pulled By: colesbury

fbshipit-source-id: ef56cb373797b180df72f0998ebc35972c892288
2020-05-11 06:44:44 -07:00
324dc1623e add dtype checking for gather and scatter (#38025)
Summary:
Fixed https://github.com/pytorch/pytorch/issues/37996

in the `cpu_scatter_gather_base_kernel`, it interpret a pointer as `int64_t` regardless the actual dtype.
2b41b9bceb/aten/src/ATen/native/cpu/ScatterGatherKernel.cpp (L106)
add a index dtype checking will avoid the nasty index out of bound error. As using `int64_t` is convention in ATen code (a.k.a, a limitation), no further fix is needed at the moment.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38025

Differential Revision: D21498146

Pulled By: ezyang

fbshipit-source-id: b1f96f394a460c4bc63d21ec8d4a2cfbf3e97b03
2020-05-10 23:15:45 -07:00
503be4e05e fixing build failures with USE_NATIVE_ARCH ON (#35359)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/35359

Differential Revision: D21497939

Pulled By: ezyang

fbshipit-source-id: d81b653714ee4d7a09945ea9677976a7b61f4d43
2020-05-10 19:27:59 -07:00
f3e620ee83 explain redundant branch/tag filters (#38169)
Summary:
Add a comment because at first glance there doesn't seem to be any need to specify branch and tag filters, just to make them glob to everything.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38169

Differential Revision: D21496261

Pulled By: kostmo

fbshipit-source-id: 7f75bb466ceffd6b17d4c97d711a8eb6e8b3143a
2020-05-10 10:34:42 -07:00
5077518c91 [Resubmit] Migrate AT_DISPATCH_ALL_TYPES_AND_COMPLEX_AND3 to c10::complex (#38144)
Summary:
This reverts commit 0c936f94d647a2c422d29cafaa923047dd243473.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38144

Differential Revision: D21495374

Pulled By: anjali411

fbshipit-source-id: 33249659fba88f087539233c3d297c0280e17208
2020-05-10 08:09:43 -07:00
26928b164f remove internal file logging.h (#38182)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38182

remove unused internal logging.h

Test Plan: sandcastle

Reviewed By: yinghai

Differential Revision: D21490039

fbshipit-source-id: bd60d332eb2a5cc67408b07f7777716a086cc7ff
2020-05-09 21:10:54 -07:00
33f4fca1a6 [TensorExpr] remove Let and LetStmt in favour of binding in Block (#37606)
Summary:
Implementation of the less popular proposal for eliminating overlap between LetStmt and Let: removing both and storing a mapping between Var and value Expr in the Block.

This complicates some tests but simplifies the IR by restricting where variable binding can occur.

I used the unit tests & python integration tests to verify this is correct but I'm unsure of coverage, particularly around the dependency checker in loopnest - ZolotukhinM your review would be useful there.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37606

Differential Revision: D21467483

Pulled By: nickgg

fbshipit-source-id: b402d3fce4cacf35d75f300f0a7dca32a43b6688
2020-05-09 16:23:37 -07:00
48ad9f5a30 assertEqual now requires matching dtypes (#38103)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/38103

Test Plan: Imported from OSS

Differential Revision: D21477062

Pulled By: VitalyFedyunin

fbshipit-source-id: 9592fed336214dd97eb8e9d6b3e16f21ff6f072d
2020-05-09 14:49:01 -07:00
57d01be92b Replacing assertEqual with assertEqualIgnoreType wherever types missmatch (#38102)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/38102

Test Plan: Imported from OSS

Differential Revision: D21477060

Pulled By: VitalyFedyunin

fbshipit-source-id: 25e0fd837ca9bfccf0ce994c80f7790c894096d4
2020-05-09 14:48:55 -07:00
e3414c1ef1 AssertEqual now checks tensors dtype (#34154)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34154

Temporary replacing with `assertEqualIgnoreType` all cases when `AssertEqual` fails.

Test Plan: Imported from OSS

Differential Revision: D20251131

Pulled By: VitalyFedyunin

fbshipit-source-id: fa69c6e2b3a7963912af5b0fa42bec9eded323d3
2020-05-09 14:47:01 -07:00
64d083bb86 fix a bracket (#38039)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/38039

Differential Revision: D21465272

Pulled By: Krovatkin

fbshipit-source-id: 509e967128b15ba171cb72744d21384b871fc8ce
2020-05-09 14:27:55 -07:00
b579433bf7 Revert D21487840: Bind VariableFunctions as a module, not a class with static methods.
Test Plan: revert-hammer

Differential Revision:
D21487840

Original commit changeset: 368da9b9c50e

fbshipit-source-id: 900f5d36490ac8d419c6704f8727d4c8e492bfb7
2020-05-09 11:58:02 -07:00
f6b1c046b6 Revert D21483808: [pytorch][PR] Remove uses of type() part 2
Test Plan: revert-hammer

Differential Revision:
D21483808

Original commit changeset: 12f5de6151ba

fbshipit-source-id: 2755fa97ae3f342ae88b1531acfa790772a27c17
2020-05-09 00:42:39 -07:00
4501083306 dedupe test skipping in common_distributed and test_distributed (#38078)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38078

`common_distributed` and `test_distributed` have some error codes that overlap but are for different reasons, for example, code 75 in `test_distributed` is "no cuda available" but in common_distributed it is "need at least 2 CUDA devices".

This is an issue because the tests in `test_distributed` now use the utils in `common_distributed`, so we could get the wrong reason for skipping tests.

It is also the source of test failures in https://github.com/pytorch/pytorch/pull/37990.

This diff makes it so that the test skipping logic is deduped and put into `common_distributed.py`, where it can be reused and then imported into `test_distributed`
ghstack-source-id: 103782583

Test Plan: CI

Differential Revision: D21466768

fbshipit-source-id: 53b5af36672ebd8b51ba8b42709d87e96cadef20
2020-05-08 23:19:26 -07:00
e109ff6379 Use py::pickle in RRef pickling pybind code (#38147)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38147

We're seeing many warnings of the form:
```
/home/rvarm1/pytorch/torch/distributed/rpc/__init__.py:14: FutureWarning:
pybind11-bound class 'torch.distributed.rpc.R
Ref' is using an old-style placement-new '__setstate__' which has been
deprecated. See the upgrade guide in pybind11's
docs. This message is only visible when compiled in debug mode.
```
in test logs, it turns out this is because pybind recommends using `py::pickle`
instead of manually defining getstate and setstate (see https://github.com/pybind/pybind11/blob/master/docs/upgrade.rst#id5). Changing to use pybind's
recommendation will silence these warnings.

Note that return types need to be added to the function to satisfy the contract
pybind expects, but they don't return anything since we TORCH_CHECK(false) in
all cases.
ghstack-source-id: 103769585

Test Plan: CI

Differential Revision: D21446260

fbshipit-source-id: a477e4937b1d6134992c57467cdbe10f54567b8b
2020-05-08 22:36:59 -07:00
30f4064cfb Bind VariableFunctions as a module, not a class with static methods. (#38136)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38136

This was a bit trickier than I expected, because modules have
to be importable to be pickleable, but adding a module to another
module in the C API isn't really the right way to make it importable.
We hack around it by manually adding the module to sys.modules.

Thanks Richard Zou for an extremely useful prior attempt which helped
me make this work.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Differential Revision: D21487840

Pulled By: ezyang

fbshipit-source-id: 368da9b9c50e5de4d7dd265e6f9f189a882d75c1
2020-05-08 22:34:34 -07:00
7e9af67ca1 Add minimal skeleton for _C type stubs, delete torch.autograd stub (#38080)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38080

Originally, my plan was to just delete the torch.autograd stub, but
this triggered a bunch of downstream errors relating to non-existent
to _C modules, and so instead of ignoring those files, I decided to
add a minimal _C type stubs, where it was easy (cases which were
codegened I ignored).

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Differential Revision: D21487841

Pulled By: ezyang

fbshipit-source-id: cfcc467ff1c146d242cb9ff33a46ba26b33b8213
2020-05-08 22:33:21 -07:00
464e5a6c07 [TensorExpr] Add print functions for Tensor and Function. (#38175)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38175

Also, make Tensor derived from KernelScopedObject - we must have missed
that originally.

Test Plan: Imported from OSS

Reviewed By: resistor

Differential Revision: D21489136

Pulled By: ZolotukhinM

fbshipit-source-id: fe003f44ef1265629fd84befc2e9ec8f48d2fc4f
2020-05-08 22:15:26 -07:00
8181711637 Automatic update of fbcode/onnx to 79a7e0df7e86e0f32e7a05f563b24a566540c18b (#38106)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38106

Previous import was 807c62cf7e4c96ce49040bcf073b7e4a054f28a5

Included changes:
- **[79a7e0df](https://github.com/onnx/onnx/commit/79a7e0df)**: Fix copy paste error of Min op test case (#2640) <Takeshi Watanabe>
- **[4cd2538d](https://github.com/onnx/onnx/commit/4cd2538d)**: Add a release pipeline for Windows python packages (#2632) <Changming Sun>
- **[e8b33a5a](https://github.com/onnx/onnx/commit/e8b33a5a)**: Adding UnfoldToDepth op [1.7 Release] (#2616) <Negin Raoof>
- **[c2a8d525](https://github.com/onnx/onnx/commit/c2a8d525)**: update docs (#2627) <Ksenija Stanojevic>
- **[22752354](https://github.com/onnx/onnx/commit/22752354)**: Generate tests for MeanSquareDistance and SoftmaxCrossEntropyLoss (#2623) <Jonny Shipton>
- **[602bd622](https://github.com/onnx/onnx/commit/602bd622)**: Add section about external tensor data to IR.md (#2323) <Jonny Shipton>
- **[165c3f3b](https://github.com/onnx/onnx/commit/165c3f3b)**: Add integer support to Clip (#2532) <Jonny Shipton>
- **[a5fabf87](https://github.com/onnx/onnx/commit/a5fabf87)**: Add operators LessOrEqual and GreaterOrEqual (as functions) (#2606) <Jeremy Cochoy>
- **[f1dcdafc](https://github.com/onnx/onnx/commit/f1dcdafc)**: Fix input document of quantized operators (#2117) <Takeshi Watanabe>
- **[43af9b69](https://github.com/onnx/onnx/commit/43af9b69)**: Add reference impl for sequence ops (#2380) <Bowen Bao>
- **[aa50aa12](https://github.com/onnx/onnx/commit/aa50aa12)**: Print value case of TypeProto more friendly (#2422) <Takeshi Watanabe>
- **[2e67bfc3](https://github.com/onnx/onnx/commit/2e67bfc3)**: Fix issue #2436 (#2447) <daquexian>
- **[d27ffc6b](https://github.com/onnx/onnx/commit/d27ffc6b)**: Add support for integer tensors to Min and Max (#2608) <Jonny Shipton>
- **[5cc668af](https://github.com/onnx/onnx/commit/5cc668af)**: Update IR.md to describe training extension. (#2615) <G. Ramalingam>
- **[8c5bf9d4](https://github.com/onnx/onnx/commit/8c5bf9d4)**: Generate node backend tests for celu operator (#2607) <Jonny Shipton>
- **[7b65287e](https://github.com/onnx/onnx/commit/7b65287e)**: Change dtype of dd_da in gradient test to float32 (#2620) <Shinichiro Hamaji>
- **[e91739f2](https://github.com/onnx/onnx/commit/e91739f2)**: Introduce SoftmaxCrossentropy as a loss function (#2573) <Ksenija Stanojevic>
- **[b008ed3a](https://github.com/onnx/onnx/commit/b008ed3a)**: Support gathernd with batch_dim mode (#2585) <wezuo>
- **[d2fe4f22](https://github.com/onnx/onnx/commit/d2fe4f22)**: Introduce MeanSquaredError as Loss Function (#2570) <Ksenija Stanojevic>
- **[10b812a6](https://github.com/onnx/onnx/commit/10b812a6)**: Add support for default attributes within FunctionExpandHelper (#2588) <Ewa Tusień>
- **[3368834c](https://github.com/onnx/onnx/commit/3368834c)**: adding version update content. (#2609) <Ke Zhang>
- **[8873cb02](https://github.com/onnx/onnx/commit/8873cb02)**: Adding Inverse Op (#2578) <Negin Raoof>

Test Plan: ci

Reviewed By: hl475

Differential Revision: D21471424

fbshipit-source-id: 5009a5f9558458a0aba56b2a9e8fffc3895a9e02
2020-05-08 21:47:11 -07:00
3d0279862d Consolidate builtin/python_udf RPC to return ivalue::Future like torchscript RPC does (#35154)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35154

This is for issue https://github.com/pytorch/pytorch/issues/34999.

close https://github.com/pytorch/pytorch/issues/34999.

https://github.com/pytorch/pytorch/issues/34997 need more work.

This will make a few work items easier, like 1) Dist autograd profiler, 2) JIT annotation for Future.

Test Plan:
```
buck test mode/dev-nosan //caffe2/test/distributed/rpc:rpc_fork

buck test mode/dev-nosan //caffe2/test/distributed/rpc:rpc_fork -- test_rref_forward_chain --stress-runs 100

buck build mode/dev-nosan //caffe2/test/distributed/rpc:rpc_fork && \
buck-out/gen/caffe2/test/distributed/rpc/rpc_fork\#binary.par \
-r test_call_method_on_rref
```

buck test mode/dev-nosan //caffe2/test/distributed/rpc:rpc_fork -- 'test_rref_proxy_class \(fb\.test_rpc_fork\.RpcTestWithFork\)' --stress-runs 100

test_rref_proxy_reuse
test_handle_send_exceptions

```
buck test mode/dev-nosan //caffe2/test/distributed/rpc/jit:rpc_fork

buck build mode/dev-nosan //caffe2/test/distributed/rpc/jit:rpc_fork && \
buck-out/gen/caffe2/test/distributed/rpc/jit/rpc_fork\#binary.par \
-r test_script_call_python_return_future
```

Differential Revision: D7722184

fbshipit-source-id: bd92b855bfea4913d6672700590c57622fa86e0e
2020-05-08 21:28:56 -07:00
86d28706e0 Remove uses of type() part 2 (#38140)
Summary:
I'm mostly done with cleaning up test/ folder. There're a bunch of remaining callsites but they're "valid" in testing `type()` functionalities. We cannot remove them until it's fully deprecated.
Next PR would mainly focus on move some callsites to an internal API.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38140

Differential Revision: D21483808

Pulled By: ailzhang

fbshipit-source-id: 12f5de6151bae59374cfa0372e827651de7e1c0f
2020-05-08 19:30:46 -07:00
16e62f9305 Unboxing uses if_constexpr instead of SFINAE (#38145)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38145

Now that if_constexpr is landed, we can make this more readable
ghstack-source-id: 103765920

Test Plan: waitforsandcastle

Differential Revision: D21480798

fbshipit-source-id: 8181d4731036373cc3a1868fd6f4baeebb426081
2020-05-08 18:15:44 -07:00
ae534dc978 [TorchScript] Explicitly disallow del with more than 1 operand. (#38089)
Summary:
del in python supports multiple operands, but PyTorch c++ frontend doesn't support that. To be consistent across different frontends, we decided to throw an exception when finding del with multiple operands inside torchscript.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38089

Test Plan: Unit tests in test/jit/test_builtins.py

Differential Revision: D21478900

Pulled By: SplitInfinity

fbshipit-source-id: 1cbd61301680c5d6652ef104996178cefcdd3716
2020-05-08 17:56:36 -07:00
138476389e [quant] Disable qnnpack test when TSAN is enabled (#38153)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38153

Test fails when opt-tsan is enabled

Test Plan: buck test mode/opt-tsan //caffe2/test:quantization -- 'test_single_linear_dynamic \(quantization\.test_quantize\.TestGraphModePostTrainingStatic\)' --run-disabled

Reviewed By: vkuzo

Differential Revision: D21482799

fbshipit-source-id: fe6d1d84f525387081fabb90ce876c7c7dafd081
2020-05-08 16:52:36 -07:00
63b1ae6983 Fix overflow in torch.remainder when dividend is very large (#37758)
Summary:
This will fix the GPU implementation in https://github.com/pytorch/pytorch/issues/37743 and https://github.com/pytorch/pytorch/issues/24861. Please also check my [comment](https://github.com/pytorch/pytorch/issues/37743#issuecomment-623285707).

The fixed `remainder_kernel` follows the similar implementation in numpy. See 79d7bc276a/numpy/core/src/npymath/npy_math_internal.h.src (L649-L658)

I also slightly update the doc for `torch.remainder`, to make it similar to `torch.fmod`.

I'm not sure how to modify the Vec256 code of CPU remainder_kernel, so I just leave it there.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37758

Differential Revision: D21388417

Pulled By: ngimel

fbshipit-source-id: 770ba5801cf34619b2b68b8b0cf95d8cfa52e6f6
2020-05-08 16:46:55 -07:00
fdc40616b2 s/callUnboxed/call/ (#37999)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37999

Next step: make explicit type arguments less intrusive, or fine
a way to eliminate them entirely.

Test Plan: Imported from OSS

Differential Revision: D21445646

Pulled By: bhosmer

fbshipit-source-id: 106b3381acea473ca686ab42b5ca610c89f5c531
2020-05-08 16:18:10 -07:00
55de7c3bb0 Add test jobs on CPU agents for CUDA builds on Windows (#37904)
Summary:
Targets https://github.com/pytorch/pytorch/pull/37811#issuecomment-624367089.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37904

Differential Revision: D21484360

Pulled By: seemethere

fbshipit-source-id: b25cbf35b8432a587bce86815c97ff444cab255c
2020-05-08 15:46:09 -07:00
e84aa0211d [JIT]Support List variable in adv indexing. (#37966)
Summary:
Followup of https://github.com/pytorch/pytorch/issues/37848 I realized that it's better to condition on `Value` type instead of token type. So now it also support indexing through list variables (used to be list literal only).
Also apparently our eager frontend accept indexing with float list as well, so matched this edge case behavior as well.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37966

Reviewed By: suo

Differential Revision: D21439642

Pulled By: ailzhang

fbshipit-source-id: cedb8431ef38747d4aa9909a6bbf8e954dbe0e25
2020-05-08 15:40:11 -07:00
c879c6fb98 Vectorize non-persistent Softmax kernels (#36485)
Summary:
Add read/write vectorization to non-persistent softmax kernels only. At this point launch logic has minimal changes, and `ILP=vectorization=2` is always used (the code can handle other values, but `ILP=2` has been the most consistent performer).

Dispatch to persistent / non-persistent kernels is unchanged.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36485

Differential Revision: D21477775

Pulled By: ngimel

fbshipit-source-id: 9ff7fd243695d7bbf4121390085b64db0bbdef35
2020-05-08 15:20:33 -07:00
615235fc80 Migrate OwnerRRef value store to generic torch Future (#38143)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38143

It's a followup of https://github.com/pytorch/pytorch/pull/32556, where an error handling boilerplate code path was added to the FutureMessage callback.

However, I noticed that the FutureMessage could never be set with an error, because the FutureMessage is a member in OwnerRRef,

- OwnerRRef does not have a setError method yet.
- The FutureMessage is only used for signaling
- The value of the RRef is contained in the `value_` field.

With the Future being generalized, it could contain more value types, not limited to Message.

This PR migrates the OwnerRRef value from the `value_` field to the generic Future.

In a later PR, it will be super easy to add a `setError` method for OwnerRRef, which calls `future_.setError(..)`. (I decide to do it later. I think it's better to migrate the call sites together with adding the new `setError` method.)

Also, this fixes the issue pointed out by https://github.com/pytorch/pytorch/pull/31086/files#r422256916.

This PR was submitted as https://github.com/pytorch/pytorch/pull/32608.
ghstack-source-id: 103757743

Test Plan:
```
buck test mode/dev-nosan //caffe2/test/distributed/rpc:rpc_fork

buck build mode/dev-nosan //caffe2/test/distributed/rpc:rpc_fork && \
buck-out/gen/caffe2/test/distributed/rpc/rpc_fork\#binary.par \
-r test_call_method_on_rref
```

Differential Revision: D5707692

fbshipit-source-id: 83ce0e5e5e97acb9ce8230fce5e4a3d806478b02
2020-05-08 15:10:32 -07:00
ca2206d071 Add documentation for FeatureAlphaDropout (#36295)
Summary:
These changes add documentation for FeatureAlphaDropout, based on a need raised in an issue by SsnL (Issue https://github.com/pytorch/pytorch/issues/9886).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36295

Differential Revision: D21478591

Pulled By: zou3519

fbshipit-source-id: a73c40bf1c7e3b1f301dc3347cef7b32e9842320
2020-05-08 15:09:01 -07:00
c13dc2cab2 Fix a minor typo in DistanceOpsKernel.cpp (#37596)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/37596

Differential Revision: D21356627

Pulled By: zou3519

fbshipit-source-id: a3e7bc6f9f150d478c6d10fb5b6797589af12add
2020-05-08 15:04:30 -07:00
ad433e2003 [TensorExpr] Fix a bug in the IR Simplifier that could introduce a division by zero (#38055)
Summary:
In the IR Simplifier when doing partial factorization of Round+Mod patterns we divide by the lower number, which could be zero. Add in a quick check against zero avoid the crash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38055

Differential Revision: D21478486

Pulled By: nickgg

fbshipit-source-id: c5083f672e91662b7d1271d817cade7fa6c39967
2020-05-08 14:58:53 -07:00
9957db22a9 int8 fc with tests (#38017)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38017

added a more comprehensive set of tests for int8fc
there are some failures but this emulation gets us much closer than the
existing one
there is still more work coming in

Test Plan: the test itself

Reviewed By: amylittleyang

Differential Revision: D21368530

fbshipit-source-id: 318722c030b2a1f8de37adb7c8633f75057edfab
2020-05-08 14:52:51 -07:00
172bcdb8c8 Add documentation for nn.Hardsigmoid and nn.functional.hardsigmoid. (#38120)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/38120

Test Plan: build docs locally and attach a screenshot to this PR.

Differential Revision: D21477815

Pulled By: zou3519

fbshipit-source-id: 420bbcfcbd191d1a8e33cdf4a90c95bf00a5d226
2020-05-08 13:56:45 -07:00
41572116f6 Dont store redundant packed params in dynamic quantized RNN (#38134)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/38134

Test Plan: Imported from OSS

Reviewed By: ailzhang

Differential Revision: D21479289

Pulled By: jamesr66a

fbshipit-source-id: 11d9ad034396ce75c5a93d1f7ebca587205089ee
2020-05-08 13:52:52 -07:00
4784af1d78 [TensorExpr] Don't include aten::rand_like to TE fusion groups since we can't handle rand+broadcast case yet. (#38132)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/38132

Test Plan: Imported from OSS

Reviewed By: resistor

Differential Revision: D21479256

Pulled By: ZolotukhinM

fbshipit-source-id: 2678cfd6ad2feea132efb5eec09e5f41bbd54487
2020-05-08 13:37:13 -07:00
6e1e2a60dc fix compilation error with gcc 5.5 (#38112)
Summary:
Fixed https://github.com/pytorch/pytorch/issues/38111
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38112

Differential Revision: D21476876

Pulled By: malfet

fbshipit-source-id: 06d25e763eb73961f7b4e4cfbd2bb59f5ab96387
2020-05-08 13:23:36 -07:00
a7c29dbfa2 unfold_backward gets its own kernel (#36612)
Summary:
`unfold_backward` uses `index_add` which causes regression on CUDA because of the underlying `atomicAdd`, and regression on CPU because of limited parallelization. This PR attempts to replace `index_add` with a custom kernel.

Fixes [https://github.com/pytorch/pytorch/issues/17501](https://github.com/pytorch/pytorch/issues/17501).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36612

Differential Revision: D21450349

Pulled By: albanD

fbshipit-source-id: 09ec1fbd5d7290656700eca8e7fb7cf52323ec28
2020-05-08 13:18:36 -07:00
0ed7fc581c [quant][graphmode][refactor] Split quantization.cpp (#37975)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/37975

Test Plan:
.

Imported from OSS

Differential Revision: D21468497

fbshipit-source-id: 35cbf98a344ca6e4094d616a4040eacf017fd2de
2020-05-08 12:24:50 -07:00
ff9a809ccd [quant][graphmode][refactor] Remove unused code in quantization.cpp (#37974)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/37974

Differential Revision: D21468498

Pulled By: jerryzh168

fbshipit-source-id: 96f34db9f98474ec8e5d33e9b7c406b1637f5de8
2020-05-08 11:03:03 -07:00
c1e7758b5e Back out "Revert D20229168: [quantization] Use torchbind for Linear PackedParams" (#38101)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38101

Original commit changeset: 29e8a4d3b8bf
ghstack-source-id: 103730417

Test Plan: waitforsadcastle

Differential Revision: D21471381

fbshipit-source-id: a922cdf31ba32021e7264ae1454c646c0bfd7ef4
2020-05-08 10:53:06 -07:00
91f451a5e6 [TensorPipe] Do not require user to provide worker name-to-rank map (#38052)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38052

The initial version of the TensorPipe agent required the user to specify the full map between workers' names and their ids, on each worker. However it's enough for each worker to just specify their name and id, as these can then be exchanged using the store.

Addresses #37784, although I think we can go further and use the store to also automatically assign ranks to workers, so that the user only needs to specify a name.
ghstack-source-id: 103741595

(Note: this ignores all push blocking failures!)

Test Plan:
On worker 0:
```
In [1]: import os
   ...: import torch
   ...: import torch.distributed.rpc as rpc
   ...: os.environ["MASTER_ADDR"] = "127.0.0.1"
   ...: os.environ["MASTER_PORT"] = "8765"

In [2]: rpc.init_rpc(name="foo", rank=0, backend=rpc.backend_registry.BackendType.TENSORPIPE, world_size=2)

In [3]: rpc.rpc_sync("bar", torch.add, args=(torch.full((2,2), 1), torch.full((2,2), 2)))
Out[3]:
tensor([[3., 3.],
        [3., 3.]])

In [4]: rpc.rpc_sync("bar", torch.add, args=(1, 2))
Out[4]: 3
```
On worker 1:
```
In [1]: import os
   ...: import torch
   ...: import torch.distributed.rpc as rpc
   ...: os.environ["MASTER_ADDR"] = "127.0.0.1"
   ...: os.environ["MASTER_PORT"] = "8765"

In [2]: rpc.init_rpc(name="bar", rank=1, backend=rpc.backend_registry.BackendType.TENSORPIPE, world_size=2)
```

Then also tested by adding `rpc_backend_options=rpc.TensorPipeRpcBackendOptions(init_method="file:///tmp/init/foo")` to `rpc_init`.

Differential Revision: D21463833

fbshipit-source-id: b53d7af6fc060789358ac845aa1898ddea6e8f31
2020-05-08 10:48:48 -07:00
b4946b96c6 Don't use Profiler key in lite interpreter (#37962)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37962

Temporarily re-enable RecordFunction in lite interpreter when profiler key is not set,
this allows the profiler to work without profiled wrappers in the build

Test Plan: CI

Reviewed By: smessmer, linbinyu

Differential Revision: D21409120

fbshipit-source-id: 6f0311c8eb55537a03b8bdac69def18a496ec672
2020-05-08 10:47:10 -07:00
726aa713d5 Replace torch.is_tensor usages with isinstance checks. (#38062)
Summary:
`is_tensor` doesn't really have a reason to exist anymore (other than
backwards compatibility) and is worse for typechecking with mypy (see
gh-32824). Given that it may not be obvious what the fix is once mypy
gives an error, make the change in a number of places at once, and add
a note on this to the `is_tensor` docstring.

Recommending an isinstance check instead has been done for quite a
while, e.g. https://github.com/pytorch/pytorch/pull/7769#discussion_r190458971
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38062

Differential Revision: D21470963

Pulled By: ezyang

fbshipit-source-id: 98dd60d32ca0650abd2de21910b541d32b0eea41
2020-05-08 10:10:11 -07:00
9232356e5f remove uses of type() and type_as() part 1. (#38029)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/38029

Differential Revision: D21468523

Pulled By: ailzhang

fbshipit-source-id: 14b7185d43eb03f630cfaa2d70e02d637ff8551b
2020-05-08 08:16:24 -07:00
0c936f94d6 Revert D21449612: [pytorch][PR] Migrate AT_DISPATCH_ALL_TYPES_AND_COMPLEX_AND3 to c10::complex
Test Plan: revert-hammer

Differential Revision:
D21449612

Original commit changeset: 236070946b9d

fbshipit-source-id: 2de485ca18388a055f44d6caf18cf516b2288875
2020-05-08 02:34:00 -07:00
0f60c8d878 [TensorExpr] Correctly print 'bool' dtype in Cuda printer. (#38077)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/38077

Test Plan: Imported from OSS

Differential Revision: D21467298

Pulled By: ZolotukhinM

fbshipit-source-id: 65ac347f097e01aaf1d3ff5d598a402ca619d1f2
2020-05-08 00:40:47 -07:00
ff1a627bae [TensorExpr] Don't include prim::Constant nodes with Tensor type into TE fusion groups - we can't handle them. (#38105)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/38105

Test Plan: Imported from OSS

Differential Revision: D21471611

Pulled By: ZolotukhinM

fbshipit-source-id: 5d06fde353e221bcbdf26935a19b589aab7e2afe
2020-05-08 00:40:42 -07:00
a253ea92fb [TensorExpr] Properly handle Bool dtype in several other places. (#38104)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/38104

Test Plan: Imported from OSS

Differential Revision: D21471612

Pulled By: ZolotukhinM

fbshipit-source-id: b582b9fda346b96df5d19bf8a160ab2b5306cb92
2020-05-08 00:39:12 -07:00
459f14e9f6 [TensorExpr] Correctly print dtypes in Cast and Allocate. (#38091)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/38091

Test Plan: Imported from OSS

Differential Revision: D21469100

Pulled By: ZolotukhinM

fbshipit-source-id: d052ff33f26321d04371557c4cf2afc0a928c6bf
2020-05-08 00:27:23 -07:00
609d5a4476 [tensorboard] Let hparam render values correctly (#31544)
Summary:
The root cause of incorrect rendering is that numbers are treated as a string if the data type is not specified. Therefore the data is sort based on the first digit.

closes https://github.com/pytorch/pytorch/issues/29906
 cc orionr sanekmelnikov
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31544

Differential Revision: D21105403

Pulled By: natalialunova

fbshipit-source-id: a676ff5ab94c5bdb653615d43219604e54747e56
2020-05-08 00:05:16 -07:00
4c358b8b72 Run QEMU to test that default dispatch doesn't use AVX (#38094)
Summary:
`qemu-x86_64 -cpu Haswell` jit compiels x86_64 code to the host OS but lacks support for AVX/AVX2 instruction set emulation, which makes it ideal target for testing instruction set violation (especially via static initializes) even if it runs on CPU physically capable of executing AVX2 instructions.
It's quite easy to validate, that it is the case, by invoking ATen's `basic` cpp test with dispatch set to AVX: `qemu-x86_64 -cpu Broadwell -E ATEN_CPU_CAPABILITY=avx ./bin/basic --gtest_filter=BasicTest.BasicTestCPU`

This PR adds extra step to CircleCI tessuite that executes `basic` test with default CPU capability for `pytorch-linux-[xenial|bionic]-py3.6-...-test` configurations using qemu and validates that it completes successfully. (And fails before https://github.com/pytorch/pytorch/pull/38088 is merged)

Closes https://github.com/pytorch/pytorch/issues/37786
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38094

Differential Revision: D21472278

Pulled By: malfet

fbshipit-source-id: 722d4eceac8ce6fbc336ab883819cf7fccea3a66
2020-05-07 22:09:57 -07:00
53aa7d8bc5 Add option to skip tests after retries (#38079)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/38079

Differential Revision: D21470238

Pulled By: malfet

fbshipit-source-id: b2e63be34090c6f61acad8b6530658a835c68870
2020-05-07 21:56:29 -07:00
d35ab0b7ae Fix CUDA memory management issues caused by not using PinnedCPUAllocator (#38066)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38066

Increasing priority for PinnedCPUAllocator to make sure it is set when CUDA is enabled.

Test Plan: buck test mode/dev-nosan //vision/fair/detectron2/tests:test_export_caffe2 -- 'testMaskRCNNGPU \(test_export_caffe2\.TestCaffe2Export\)'

Reviewed By: ppwwyyxx

Differential Revision: D21465835

fbshipit-source-id: 643cff30d35c174085e5fde5197ddb05885b2e99
2020-05-07 21:52:00 -07:00
f4d9713d12 Migrate AT_DISPATCH_ALL_TYPES_AND_COMPLEX_AND3 to c10::complex (#37977)
Summary:
`AT_DISPATCH_ALL_TYPES_AND_C10_COMPLEX_AND3` is removed
`AT_DISPATCH_ALL_TYPES_AND_COMPLEX_AND3` is now using `c10::complex`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37977

Differential Revision: D21449612

Pulled By: anjali411

fbshipit-source-id: 236070946b9d6fc89533d196f17fa9c7275d83b5
2020-05-07 21:47:41 -07:00
deeef50432 Check the _geev input matrix for NaNs and infs (#37642)
Summary:
If we don't do this we risk a segmentation fault from the Intel MKL.
Fixes https://github.com/pytorch/pytorch/issues/37499
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37642

Differential Revision: D21465181

Pulled By: pbelevich

fbshipit-source-id: 809dca11f11de91018d978578bc11737b879d6ec
2020-05-07 21:33:37 -07:00
16e3df3ac6 Fix typo: TupleUnpack. (#38043)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38043

Fixes https://github.com/pytorch/pytorch/issues/37183

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Differential Revision: D21471000

Pulled By: ezyang

fbshipit-source-id: feea7021a23a68053db32ef02751e97e1b61ca8f
2020-05-07 20:57:20 -07:00
5a386a0a78 Fix ldflags string for HIPExtensions (#38047)
Summary:
This pull request adds a check for ROCm environment and skips adding CUDA specific flags for the scenario when a pytorch extension is built on ROCm.

ezyang jeffdaily
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38047

Differential Revision: D21470507

Pulled By: ezyang

fbshipit-source-id: 5af2d7235e306c7aa9a5f7fc8760025417383069
2020-05-07 20:39:01 -07:00
c2f787ce77 Give _VariableFunctions class a different name, so pickling works (#38033)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38033

Pickles require class names to be actually accessible from the module
in question.  _VariableFunction was not!  This fixes it.

Fixes https://github.com/pytorch/pytorch/issues/37703

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Differential Revision: D21458068

Pulled By: ezyang

fbshipit-source-id: 2a5ac41f9d1972e300724981b9b4b84364ddc18c
2020-05-07 20:34:21 -07:00
9fe8243536 Fix minor issue in type stub for Optimizer (#38067)
Summary:
Closes gh-23731
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38067

Differential Revision: D21471021

Pulled By: ezyang

fbshipit-source-id: 8e7ee7f437bfa8e78a47ac6cf572b0fc9b5c6939
2020-05-07 20:11:40 -07:00
3cade9cdd4 Automatic update of fbcode/onnx to 807c62cf7e4c96ce49040bcf073b7e4a054f28a5 (#37983)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37983

Previous import was 9fdae4c68960a2d44cd1cc871c74a6a9d469fa1f

Included changes:
- **[807c62cf](https://github.com/onnx/onnx/commit/807c62cf)**: Training Proposal: Spec Changes and Gradient Operator (#2314) <Wei-Sheng Chin>

Test Plan: ci

Reviewed By: hl475

Differential Revision: D21441188

fbshipit-source-id: 88b5be5bd479b59bdb45525f5dfe61d787151cdd
2020-05-07 20:07:31 -07:00
12bbda053c Remove static initalizers from Vec256 (#38088)
Summary:
Follow up after PR https://github.com/pytorch/pytorch/pull/37767
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38088

Test Plan: `qemu-x86_64 -cpu Broadwell -E ATEN_CPU_CAPABILITY=default ./bin/basic --gtest_filter=BasicTest.BasicTestCPU`

Differential Revision: D21468726

Pulled By: malfet

fbshipit-source-id: dcf9ce3d7816e9bdfe5d0fcf8b7cad42c0f77b4c
2020-05-07 20:03:30 -07:00
25413635d0 [c2][opt] nomnigraph transform for ClipRangesGatherSigridHashV2 fusion (#38004)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38004

Un-backout of D21353550, originally D21262085. No changes here, fix in D21445881.

Fuse ClipRanges + GatherRanges + SigridHash -> ClipRangesGatherSigridHashV2

dpa_product_ctr model's dper2 to dper3 migration is blocked by 3.6% higher prospector cpu usage. Root cause is traced down to sigrid transforms, where ClipRanges, GatherRanges, SigridHash are separately called, instead of fused, as is the case in dper2.

Further context:
https://fb.quip.com/GijaAZtX5mav
https://fb.quip.com/pIDdAjJP2uiG

Test Plan:
Local benchmarking with small model 181513584_0
(Dper3 full model is 178772812, dper2 refresh is 178770392)

Transform turned on: P129799373
Iters per second: 609.291

Transform turned off: P129799397
Iters per second: 519.088

We also want to confirm this performance on the full model in canary and in qrt.

`buck build mode/opt-clang mode/no-gpu caffe2/caffe2/fb/predictor:ptvsc2_predictor_bench`

`MKL_NUM_THREADS=1 OMP_NUM_THREADS=1 ./buck-out/opt/gen/caffe2/caffe2/fb/predictor/ptvsc2_predictor_bench --pred_net=/data/users/ansha/tmp/dpa/small_pred_net.pb --c2_model=/data/users/ansha/tmp/dpa/181513584_0.predictor --c2_inputs=/data/users/ansha/tmp/dpa/c2_inputs_small.pb --iters=3000 --warmup_iters=100 --num_threads=32 --c2_apply_nomnigraph_passes=1 --caffe2_predictor_enable_preproc_fusion=1`

Run dbgo build to check that all transforms happen.

Check that ClipRangesGatherSigridHash is used: https://fburl.com/scuba/caffe2_operator_stats_canary/e6qfdsat

Canaries:
https://our.intern.facebook.com/intern/ads/canary/426498918895712377/
https://our.intern.facebook.com/intern/ads/canary/426498905389730718/
https://our.intern.facebook.com/intern/ads/canary/426498901795492517/

Dbgo canaries:
https://our.intern.facebook.com/intern/ads/canary/426498888067456166/
https://our.intern.facebook.com/intern/ads/canary/426498879652089095/
https://our.intern.facebook.com/intern/ads/canary/426498873491575187/
https://our.intern.facebook.com/intern/ads/canary/426498860171351505/

Reviewed By: houseroad

Differential Revision: D21445887

fbshipit-source-id: a3c15ee30465de693f434b6ee041025c276581ac
2020-05-07 20:00:35 -07:00
32329c3338 [nomni] fix outputs check to replaceSubgraph (#38005)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38005

D21445887 runs into a dbgo build crash on this stack P130135519

It is because the assertion sg_inputs_copy.size() == 0 is too restrictive.
nn::getOutputs(sg) returns "output" nodes which can include any inputs
that have additional consumers that are not in the subgraph itself.
To fix, proposing to remove inputs from the output check.

Test Plan:
Run tests

Sanity canaries:
https://our.intern.facebook.com/intern/ads/canary/426498931666198610/
https://our.intern.facebook.com/intern/ads/canary/426498935267166205/

Reviewed By: bwasti

Differential Revision: D21445881

fbshipit-source-id: 419a4b1a230f0370619cea574403bfa114e56a7c
2020-05-07 19:58:15 -07:00
f8c93c5d3e Get rid of javasphinx dependency. (#38042)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38042

Fixes https://github.com/pytorch/pytorch/issues/36064

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Differential Revision: D21460484

Pulled By: ezyang

fbshipit-source-id: 553cbacc4365cfd84ff4a468a7366b12eade6fe0
2020-05-07 19:52:31 -07:00
4bc0a7f86a Revert D20229168: [quantization] Use torchbind for Linear PackedParams
Test Plan: revert-hammer

Differential Revision:
D20229168

Original commit changeset: 3607cac9aa5b

fbshipit-source-id: 29e8a4d3b8bffd95ff6a58b46c4f1c1e23770304
2020-05-07 19:47:45 -07:00
29f19bf727 [ONNX] Enable tests for opset 12 (#37846)
Summary:
Update ORT nightly version and enable opset 12 tests.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37846

Reviewed By: hl475

Differential Revision: D21467903

Pulled By: houseroad

fbshipit-source-id: 20d249790edfb0091a02ebfc58c3d306087e8471
2020-05-07 19:39:08 -07:00
5ee2302349 Add links to more subdir READMEs in CONTRIBUTING.md (#38049)
Summary:
I think it would be nice to have these extra README links here so they're easier to find. There are even more READMEs throughout the source tree that I didn't include, but most of them seem to have pretty minimal information.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38049

Differential Revision: D21470749

Pulled By: ezyang

fbshipit-source-id: aa164a3776ab90f2453634082eeae20c0dd002ce
2020-05-07 19:37:33 -07:00
9efbc19f75 Fix the issue with C2 cont build
Summary: Issue was introduced in D21258652. We need to make sure it compiles with opt mode. We may still have some left over py2 packages. Let's just use some format work with both.

Test Plan: ci

Reviewed By: xush6528

Differential Revision: D21457394

fbshipit-source-id: cde79a0fc6b4feba307bd9d45e1a1d4a42de9263
2020-05-07 19:33:00 -07:00
4ae187f6cb Set SCCACHE_IDLE_TIMEOUT to INFINITE(0) on Windows (#37993)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/37993

Differential Revision: D21470487

Pulled By: ezyang

fbshipit-source-id: 8a16e65c539439ff4ca0cba4a3b0bf144b8d85c9
2020-05-07 19:27:45 -07:00
bfa5070cbc Fix rebuild with Ninja on Windows (#37917)
Summary:
It is currently broken due to a ninja bug.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37917

Differential Revision: D21470357

Pulled By: ezyang

fbshipit-source-id: c0ed858c63a7504bf2c4961dd7ed906fc3f4502a
2020-05-07 19:15:27 -07:00
eaf9b28c55 [quantization] Use torchbind for Linear PackedParams (#34140)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/34140

Test Plan: Imported from OSS

Reviewed By: ZolotukhinM

Differential Revision: D20229168

Pulled By: jamesr66a

fbshipit-source-id: 3607cac9aa5b4b044572329742baed03350491c6
2020-05-07 19:03:44 -07:00
e3fcc6ade8 Skip RPC profiling tests (#38045)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38045

We are working on fixing these (e.g.
https://github.com/pytorch/pytorch/pull/37311) but a few PRs still need to land
before these tests are fixed. Disable them for now to avoid noise
ghstack-source-id: 103701518

Test Plan: CI

Differential Revision: D21461340

fbshipit-source-id: fbb029a19a93d439c9fce8424be0fb6409b52ff3
2020-05-07 18:47:06 -07:00
d5df055bbb [WIP][JIT] Add JIT backend registration API (#35833)
Summary:
**Summary**
This commit adds `torch::jit::RegisterBackend`, an API that allows
external backends to be registered for the execution of JIT subgraphs
outside the JIT interpreter. In order to register an external backend,
one must extend the provided abstract class `PyTorchBackendInterface` and provide
two additional functions: one that creates an instance of the aforementioned subclass
of `PyTorchBackendInterface`, and another that preprocesses a `ScriptModule` so that
it can run on the backend. Then, a `ScriptModule` that can compile and execute a given
JIT subgraph using the functions provided at registration time is generated
for each registered backend.

**Testing**
This commit adds a unit test that uses a minimal test backend
to make sure that the registration endpoint and generated
`ScriptModule` work.

```
$ python test/test_jit.py TestBackends
Fail to import hypothesis in common_utils, tests are not derandomized
.
----------------------------------------------------------------------
Ran 1 test in 0.183s

OK

```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35833

Differential Revision: D21231955

Pulled By: SplitInfinity

fbshipit-source-id: 452db1123d0e5d83f97fe5da8a00fdfdb50dbef9
2020-05-07 18:15:26 -07:00
002f5ec51b Add preprocessing that fuses decomposed linear into linear. (#37937)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37937

Sometime traces models dont preseve aten::linear ops and they are decomposed
into addmm or mul + add. Adding thie preprocessing step helps us catch more
lowerable linear nodes.
Please see the test for example.

Test Plan: python test/test_xnnpack_integration.py

Reviewed By: xcheng16

Differential Revision: D21428069

fbshipit-source-id: 6c4ea3335eaf5722852c639fb4ee593746bb408f
2020-05-07 18:08:36 -07:00
376c9a40dc Fix dummy typo in skipIfNoFBGEMM (#38058)
Summary:
I've picked wrong revision when landed the diff, it should have had an actual check rather than `if True`:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38058

Differential Revision: D21466152

Pulled By: malfet

fbshipit-source-id: 03fdc510562fab44b7d64a42284d4c3c1f8e940a
2020-05-07 18:03:48 -07:00
a42616f71a Fix torch.tensor dtype inference (#38030)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38030

Resolves: https://github.com/pytorch/pytorch/issues/36834

Test Plan: Imported from OSS

Differential Revision: D21462729

Pulled By: anjali411

fbshipit-source-id: 456b01e96fc3eac0ddf572703636459e05649316
2020-05-07 17:41:08 -07:00
f2f8027760 [TensorExpr] simplify trivial adds/subs/muls even in Float (#37960)
Summary:
The IR Simplifier early exits when working with dtypes that are not safe to reorder. There are some cases where we still want to simplify ops in these dtypes: x + 0,  x - 0, x * 0 and x * 1.  It's safe to eliminate the op here and it reduces clutter in the expr.

Also added a quick simplification of casts which do nothing (their type is the same as the underlying).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37960

Differential Revision: D21457736

Pulled By: nickgg

fbshipit-source-id: 40e20a3b55fc1afb2ec50071812238a08bded2ac
2020-05-07 17:23:47 -07:00
379e717a1b Back out "Revert D18927220: if_constexpr for C++14" (#37792)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37792

Original commit changeset: a1b8755a2790
ghstack-source-id: 103609715

Test Plan: waitforsandcastle

Differential Revision: D21389755

fbshipit-source-id: 1a3c74295dbfbf07fe225be9bcd47d11e31a20fa
2020-05-07 15:20:55 -07:00
5e83a13e14 stop creating integer type Tensors that require gradients (#37789)
Summary:
Fix https://github.com/pytorch/pytorch/issues/37680

Makes two changes:
- Add `argmin`, `argmax` and `argsort` to the list of non-differentiable functions to prevent them from generating outputs that requires_grad.
- Add a check to make sure we don't add such functions to the codegen by mistake.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37789

Differential Revision: D21389201

Pulled By: albanD

fbshipit-source-id: 6a7617e389e893f6f813d50f02700d32300b1386
2020-05-07 15:08:35 -07:00
facc5e0cc4 Make profiler thread local (#36291)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36291

Move profiler state to be a thread local property,
reuse existing thread local propagation mechanism to ensure
correct profiling of async tasks. This also makes
push/pop callback thread safe and easier to use in e.g.
distributed profilier

Test Plan:
USE_BLAS=MKL USE_MKLDNN=0 USE_CUDA=0 python setup.py develop install
./build/bin/test_jit

./build/bin/test_jit
python test/test_autograd.py
python test/test_jit.py

Differential Revision: D20938501

Pulled By: ilia-cher

fbshipit-source-id: c0c6c3eddcfea8fc7c14229534b7246a0ad25845
2020-05-07 14:52:49 -07:00
2ef4010593 Propagate TLS callbacks with ThreadLocalState (#37745)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37745

This PR makes it possible to set TLS callbacks and use
them transparently not only in the main thread but also
in any async tasks

Test Plan: Imported from OSS

Differential Revision: D21374873

Pulled By: ilia-cher

fbshipit-source-id: 3be2e121673b32d7694e17e794f3b474826dffe9
2020-05-07 14:52:44 -07:00
2d708cefcc Move RecordFunction into ATen (#37548)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37548

Moving RecordFunction from torch::autograd::profiler into at namespace

Test Plan:
CI

Imported from OSS

Differential Revision: D21315852

fbshipit-source-id: 4a4dbabf116c162f9aef0da8606590ec3f3847aa
2020-05-07 14:52:39 -07:00
c24c5f9684 Make RecordFunction callbacks thread local and modernize interface (#37491)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37491

This PR modernizes RecordFunction API and adds thread local callbacks
in addition to the global ones

Changes:
 - support for TLS callbacks, this is going to be the foundation of profiler and other tools
 - modernize interface around simple set of functions (add|remove|has|clear)(Global|ThreadLocal)(Callback) and adding RecordFunctionCallback to easily construct callbacks to be passed
 - we also add `.setShouldRun` into the callback interface to support cases when simple uniform sampling is not enough
 - to properly support add/remove introduce the idea of callback handle returned by add
 - internal implementation still uses SmallVector to store intermediate state (as before) - in this case these are vector of handles of callbacks that were picked to run
 - to speed up runtime we keep these vectors sorted, this way we can quickly enumerate callbacks that need to be run
 - added tests for new functionality

Test Plan:
BUILD_BINARY=1 USE_BLAS=MKL USE_MKLDNN=0 USE_CUDA=0 python setup.py
develop install
./build/bin/test_jit
CI

record_function_benchmark: https://gist.github.com/ilia-cher/f1e094dae47fe23e55e7672ac4dcda2f

Imported from OSS

Differential Revision: D21300448

fbshipit-source-id: 6d55c26dbf20b33d35c3f1604dcc07bb063c8c43
2020-05-07 14:51:02 -07:00
dc25190833 Move resize / zero logic for _thnn_conv_depthwise2d from codegen to native code. (#37957)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/37957

Test Plan: Imported from OSS

Differential Revision: D21433212

Pulled By: gchanan

fbshipit-source-id: fb431d5cf06afe2bb87fa2d73e15046f9a8d044d
2020-05-07 14:27:43 -07:00
ed4e7cec03 Move _thnn_conv2d resize and zero code from codegen to native code. (#37956)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37956

This is basically just doing what the CPU code already does, but with keeping the kernel in THC, unlike in CPU where that has already moved to native.

Test Plan: Imported from OSS

Differential Revision: D21433211

Pulled By: gchanan

fbshipit-source-id: b7440aa50905b8c94b087eaa95f5b20a27b19d3a
2020-05-07 14:26:13 -07:00
99349393ba Fixed gradcheck for complex (#37836)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/37836

Test Plan: Imported from OSS

Differential Revision: D21456881

Pulled By: anjali411

fbshipit-source-id: 9ccd130f7f23fc7b47c1c0a1f6ebfa0df0332c06
2020-05-07 14:13:03 -07:00
8a8b7a16be Remove unpacked int8 blob after constructing the packed blob to save memory (#37973)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37973

Fix the unexpected memory usage issue in model QRT for the OC model.

Test Plan:
```
buck test mode/opt caffe2/caffe2/quantization/server:fully_connected_dnnlowp_op_test
```
```
buck test  mode/opt caffe2/caffe2/fb/fbgemm:int8_serializer_test
```

Reviewed By: hx89

Differential Revision: D21422257

fbshipit-source-id: cc586123b8bfe41c85c6f2f7e493954845ad18a2
2020-05-07 14:05:30 -07:00
f0f587366c [Tensorpipe Agent] Implementing getMetrics with currently available metrics (#37980)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37980

This implements `TensorPipeAgent::getMetrics` with the metrics currently available. Will add other metrics such as Client/Server Active Calls once time outs are implemented.
ghstack-source-id: 103624005

Test Plan: CI

Differential Revision: D21439184

fbshipit-source-id: 8a15df58cc23cdf954e604c0f806877ba111e0a6
2020-05-07 14:02:22 -07:00
5d21a9cfc7 [Tensorpipe Agent] Network Data Profiling (#37852)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37852

This tracks network-related metrics in the Tensorpipe RPC Agent including number of bytes sent and recieved on each node, number of errors, number of successful calls, etc.
ghstack-source-id: 103681018

Test Plan: CI

Differential Revision: D21340499

fbshipit-source-id: 5682a3351a6394de92a7430869b24fc56c08d793
2020-05-07 14:02:16 -07:00
25359f7392 [Tensorpipe Agent] Implement Global Interpreter Lock Wait Time Metric (#37851)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37851

Tracks `GilWaitTime` metric in the Tensorpipe RPC Agent.
ghstack-source-id: 103528374

Test Plan: CI

Differential Revision: D21339527

fbshipit-source-id: 7acc2cf304e1172de21b0e459f2c97430cd22834
2020-05-07 14:02:11 -07:00
b452fef583 [Tensorpipe Agent] Base Structs for Tracking RPC Metrics (#37850)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37850

Adding the base structs for tracking time-series metrics in the Tensorpipe RPC Agent
ghstack-source-id: 103528373

Test Plan: CI

Differential Revision: D21339520

fbshipit-source-id: 8334044cdded44a940800c1d1f14d07ffab1a7e2
2020-05-07 14:00:26 -07:00
a44824c9ed [TensorExpr] Allow to enable/disable fallback mechanism thru an envvar PYTORCH_TENSOREXPR_FALLBACK. (#37971)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/37971

Test Plan: Imported from OSS

Reviewed By: protonu

Differential Revision: D21444831

Pulled By: ZolotukhinM

fbshipit-source-id: c75f58772a4730e8f40f05491f9e5afa4aa3ed30
2020-05-07 12:20:31 -07:00
067f08c148 [TensorExpr] Move controlling knob out of the TE fuser pass. (#37970)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37970

This change makes the pass friendlier for users who try to invoke it
directly.

Test Plan: Imported from OSS

Reviewed By: suo

Differential Revision: D21444832

Pulled By: ZolotukhinM

fbshipit-source-id: 8be4b5028b3bd84082874e16f38a70b245af5d19
2020-05-07 12:18:31 -07:00
3066d3ac1c Remove overly strict assertion for type demotion of scalars. (#38001)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/38001

Reviewed By: ZolotukhinM

Differential Revision: D21445921

Pulled By: resistor

fbshipit-source-id: efe441eea5c2996d919c5c2621b13a379e68accb
2020-05-07 11:51:17 -07:00
7bf9d983ea [quant] Release qnnpack original weights for conv/linear (#37595)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37595

QNNPACK currently does not support an unpack function. So we store the original weights in the packed structure which is directly returned to the user when unpack is called.
However for memory constrained environments (like mobile), storing these extra weights in memory is expensive. We need to release these weights after packing on mobile to free up the memory. As a side-effect user cannot call unpack on mobile once the model is run.

The change is gated by C10_MOBILE which is enabled for mobile builds.

The change saves 36MB on device for Speech Model.

Test Plan:
python test/test_quantization.py

Imported from OSS

Differential Revision: D21365495

fbshipit-source-id: 66465ea0b4a10d44187d150edfb90d989e872b65
2020-05-07 11:46:32 -07:00
dd64d26d74 Make speed_benchmark_torch report latency in us (#37953)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37953

Earlier it said us but reported ms.

Test Plan: buck run aibench:run_bench -- -b aibench/specifications/models/pytorch/fbnet/fbnet_mobile_inference.json --devices s9u --remote --framework pytorch --logger_level info --job_queue aibench_interactive --platform android/full_jit

Reviewed By: xcheng16

Differential Revision: D21349612

fbshipit-source-id: b97b6216eb0264123ff2c7852a0678b2008b0bf1
2020-05-07 11:08:14 -07:00
85fccba224 Message Delay fix for test_check_failed_messages (#37978)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37978

Faulty PGA tests now have message delayed by default. Tests that were written prior to this addition should explicitly turn this off since they are not designed to work reliably with message delays.
ghstack-source-id: 103622888

Test Plan: Stress-running this test with TSAN. Also added a sanity check in the verify_backend_options test that verifies the default value of `messages_to_delay`.

Differential Revision: D21440043

fbshipit-source-id: 78151f07a3294c3dfcfaeacd6a5e5b77a0f34da1
2020-05-07 10:51:59 -07:00
305444a0bd Update miniconda repository, be specific about cudatoolkit (#37186)
Summary:
Miniconda repo has moved from continuum.io to anaconda.com

Also we should be specific about cudatoolkit version so that it installs
the right CUDA version.

Resolves https://github.com/pytorch/pytorch/issues/37047

Signed-off-by: Eli Uriegas <eliuriegas@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37186

Differential Revision: D21443147

Pulled By: seemethere

fbshipit-source-id: 856718822bdd3ce51bbc6e59b0609fe6af77bd79
2020-05-07 09:58:18 -07:00
2b41b9bceb [BE] Add @skipIfNoFBGEMM decorator (Reland) (#37894)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/37894

Differential Revision: D21449993

Pulled By: malfet

fbshipit-source-id: d9d355d360384cbb158f62b40dc885527f22ee05
2020-05-07 09:43:53 -07:00
1667aa6451 [CUDA_FUSER] Expand operation support for cuda fuser (#37849)
Summary:
This PR added more supported operations in CUDA fuser. We are covering major point-wise operations supported in legacy fuser.

In an attempt to adapt to legacy executor:
1. added an naive shape propagation pass on pytorch JIT IR;
2. small refactor on graph partitioning;
3. fallback interpreter execution of fusion group;
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37849

Reviewed By: yf225

Differential Revision: D21444320

Pulled By: soumith

fbshipit-source-id: 712e18ab8497f8d58a07e6f8d200cdab52cf0d74
2020-05-07 09:21:09 -07:00
ffed9dca42 [TensorPipe] Update submodule (#38013)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/38013

Reviewed By: mrshenli

Differential Revision: D21450029

Pulled By: lw

fbshipit-source-id: 9e449af76d9e232a8c800981db421c4f390c49b2
2020-05-07 08:58:02 -07:00
b2cc9928dd Move resize logic for bmm from codegen to native code. (#37955)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/37955

Test Plan: Imported from OSS

Differential Revision: D21433213

Pulled By: gchanan

fbshipit-source-id: 421c566471279b53348bc77e738af13a1f3e1f9e
2020-05-07 08:25:46 -07:00
ee1ddcef8d Acquire GIL when constructing/destructing ConcretePyObjectHolder (#37870)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/37870

Test Plan: Imported from OSS

Differential Revision: D21410785

fbshipit-source-id: 374d5f40fbdfec98262aa4c84ec4ccdc40fb2ac1
2020-05-07 07:37:39 -07:00
594b33ea10 Add support for non-persistent buffers. (#37191)
Summary:
Issue: https://github.com/pytorch/pytorch/issues/18056
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37191

Differential Revision: D21428373

Pulled By: albanD

fbshipit-source-id: a7d367bafb95137e1bc380178b82b08eff5d5a5a
2020-05-07 06:52:31 -07:00
46ed3349f3 Add --check-untyped-defs to mypy.ini and test suite (#37594)
Summary:
Also move the ignores for imports to the bottom in `mypy.ini`, those are much less interesting - start with the stuff people want to work on.

Second commit tests the instructions: remove an ignore, fix the issue.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37594

Differential Revision: D21434858

Pulled By: ezyang

fbshipit-source-id: 4f1a6868cdb4cb59d072bcf105f48c3a5ba3ff98
2020-05-07 06:36:01 -07:00
30fc58cfcc Migrate CUDA where, tril, triu to c10::complex (#37896)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/37896

Test Plan: Imported from OSS

Differential Revision: D21442150

Pulled By: anjali411

fbshipit-source-id: b80ff801572a61f76dd25f94726a2a6334a89f3b
2020-05-07 06:08:37 -07:00
7be9796cc4 [ONNX] Support clamp_min and clamp_max (#37872)
Summary:
clamp_min is used in `torch.nn.functional.normalize`. Update symbolic_opset11 to support with updated clip in onnx opset 11.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37872

Reviewed By: hl475

Differential Revision: D21440450

Pulled By: houseroad

fbshipit-source-id: a59cbec3f4d00c3f6654da6a747fbfca59d618f1
2020-05-07 04:39:46 -07:00
bc09478a60 [TensorPipe] Use the new multi-payload message API (#37919)
Summary:
In D21209901 TensorPipe added support for a vector of payloads inside each message, instead of a single one, so that users with multiple payloads can send them separately as they are instead of having to copy them into a new block of contiguous memory. The PyTorch agent is using the old API, which is preventing us from deleting it. This change has no effects on over-the-wire format and thus on performance.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/37919

ghstack-source-id: 103572164

Test Plan:
On both workers
```
import os
import torch
import torch.distributed.rpc as rpc
os.environ["MASTER_ADDR"] = "127.0.0.1"
os.environ["MASTER_PORT"] = "8765"
```
On worker 0
```
rpc.init_rpc(name="foo", rank=0, backend=rpc.backend_registry.BackendType.TENSORPIPE, world_size=2, rpc_backend_options=rpc.TensorPipeRpcBackendOptions(worker_name_to_id={"foo": 0, "bar": 0}))
```
On worker 1
```
rpc.init_rpc(name="bar", rank=1, backend=rpc.backend_registry.BackendType.TENSORPIPE, world_size=2, rpc_backend_options=rpc.TensorPipeRpcBackendOptions(worker_name_to_id={"foo": 0, "bar": 0}))
```
On worker 0
```
In [15]: rpc.rpc_sync("bar", torch.add, args=(torch.full((2,2), 1), torch.full((2,2), 2)))
Out[15]:
tensor([[3., 3.],
        [3., 3.]])

In [16]: rpc.rpc_sync("bar", torch.add, args=(1, 2))
Out[16]: 3
```

Differential Revision: D21425536

fbshipit-source-id: a0ec2be825556b39aff018a2834baf815a6d8fa5
2020-05-07 02:52:30 -07:00
978ad16290 [TensorPipe] Allow passing args to agent options constructor (#37918)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37918

ghstack-source-id: 103569096

Test Plan: Tested top of stack

Reviewed By: jiayisuse

Differential Revision: D21425537

fbshipit-source-id: 2e78d700ea774944c7fd8b22e152d8e459dd422a
2020-05-07 02:50:47 -07:00
4e93844ab1 remove deprecation warning on get_contiguous_memory_format (#37963)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37963

This function is still widely used in the codebase, so we don't want
to add noise to builds with a bunch of warnings. Seems like the
comment + macro are already pretty good indications that this
functionality is considered legacy

Test Plan: Imported from OSS

Differential Revision: D21434447

Pulled By: suo

fbshipit-source-id: 08162ed6502894ea5d3ccb92dfa0183232cc2ab5
2020-05-07 02:06:22 -07:00
65260d48c8 Fix splitWithTail to insert the tail immediately after the outer loop. (#37941)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/37941

Differential Revision: D21429733

Pulled By: resistor

fbshipit-source-id: 12094d990c11da8b44f32a52aa5e50b3f3575145
2020-05-07 00:05:23 -07:00
9143d7fb68 [Fakelowp] Open source fake fp16 FC ops (#37923)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37923

ATT. Previously we missed this one.

Test Plan: unittests

Reviewed By: hyuen

Differential Revision: D21426190

fbshipit-source-id: de85892a50a4b4820386e0f0d6adc34d12b33788
2020-05-06 23:53:27 -07:00
76c964dfb0 Reland [quant][tests] Enable tests to run on all qengine backends (#37943)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37943

Refactor tests to use supported_qengines

Test Plan:
python test/test_quantization.py

Imported from OSS

Differential Revision: D21435514

fbshipit-source-id: 8004ef2535e1cc65036f331c00af27ded1c04a6b
2020-05-06 22:38:50 -07:00
122587dcb4 [ONNX] Improve error checking for large model export (#37798)
Summary:
* Add error message when onnx model file path is not a string.
* Add error message when model size exceed 2GB when large model export is not turned on.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37798

Reviewed By: hl475

Differential Revision: D21440571

Pulled By: houseroad

fbshipit-source-id: 054aaa25ab0cffc229f9b487a2c160623c89b741
2020-05-06 22:35:00 -07:00
385f7e59a7 Report test stats (#37803)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/37803

Differential Revision: D21445128

Pulled By: malfet

fbshipit-source-id: 9d73b3fa32ece56309cf5ef08bd9e7fc64e0a69e
2020-05-06 22:26:07 -07:00
952e0f00a4 Skip c2_ref_tests on network failures (#37972)
Summary:
Skip the tests if network is unaccessible and model can not be downloaded
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37972

Differential Revision: D21441996

Pulled By: malfet

fbshipit-source-id: 5ce59764584974aee9195572338ada1fa0351a75
2020-05-06 22:19:28 -07:00
72e5b7ae5b Add option to run python unittests in parallel (#37180)
Summary:
So far results looks quite promising: test_nn is purely sequential tests and can be accelerated 3x
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37180

Differential Revision: D21437871

Pulled By: malfet

fbshipit-source-id: 8679a8af355f839f2c9dae3bf36d2e102af05425
2020-05-06 22:14:11 -07:00
681c6fb60f Move complex utilities out of Half.h (#37676)
Summary:
There is no reason to put complex utilities to half header.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37676

Differential Revision: D21440270

Pulled By: anjali411

fbshipit-source-id: bbed5fcb5be33f6a4aedcc9932595d43d97672f6
2020-05-06 19:46:05 -07:00
634282112b updated create input and add test methods and added a whitelist for complex (#37835)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/37835

Test Plan: Imported from OSS

Differential Revision: D21434429

Pulled By: anjali411

fbshipit-source-id: 2590dfbae3e60c1a1019c96fe1c0b177ae088ccf
2020-05-06 19:40:25 -07:00
14fc83ebc7 Add missing c10::complex::value_type (#37677)
Summary:
There is such a member type as in https://en.cppreference.com/w/cpp/numeric/complex
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37677

Differential Revision: D21410197

Pulled By: anjali411

fbshipit-source-id: 749be1d71190e4afc13513b396da47f33cb990c7
2020-05-06 19:36:20 -07:00
09bedec29e move quantization normalization layers to aten/src/ATen/native/quantized/cpu/ (#37352)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37352

Implementing cleanup requested on #36835.

Test Plan:
```
python test/test_quantization.py TestQuantizedOps.test_instance_norm
python test/test_quantization.py TestQuantizedOps.test_group_norm
python test/test_quantization.py TestQuantizedOps.test_qlayer_norm
```

Imported from OSS

Differential Revision: D21261139

fbshipit-source-id: bebcad62a21a082152281a50defaa82aa769935a
2020-05-06 19:01:39 -07:00
4fa049c525 add quantized instancenorm operator (#36847)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36847

Adds a quantized instancenorm operator, which can reuse most of
groupnorm's logic.

Benchmarking shows that the quantized version is about 10x faster than
floating point for equivalent input sizes
(https://gist.github.com/vkuzo/2f230e84d26f26cc6030afdbfbc8e7f0)

Test Plan:
```
python test/quantization/test_quantized.py TestQuantizedOps.test_instance_norm
```

Imported from OSS

Differential Revision: D21107925

fbshipit-source-id: 6bacda402f0eb9857bc8f9a5cf8ef306150613d4
2020-05-06 19:01:33 -07:00
b837d5d418 add quantized groupnorm operator (#36835)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36835

Adds a quantized groupnorm operator.  We reuse most of the layernorm
kernel, modifying it to be able to perform channel-wise scaling.

Benchmark results: the quantized layer is between 6x to 15x faster
from fp to q, depending on input shapes
(full results:
https://gist.github.com/vkuzo/db67623232415382dabff6c8923124e9)

Test Plan:
```
python test/quantization/test_quantized.py TestQuantizedOps.test_group_norm
python test/quantization/test_quantized.py TestQuantizedOps.test_qlayer_norm
```

Numerics are nearly equivalent, with the only difference documented
in the test case.  The difference is the same type as with quantized
layernorm.  Making numerics equivalent is possible but will sacrifice
speed.

Imported from OSS

Differential Revision: D21107926

fbshipit-source-id: 80e87e9e2c71310bc28c3d114c88de428819cb45
2020-05-06 19:01:26 -07:00
288dd33770 quant: remove hypothesis and int32 from layernorm test (#37947)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37947

The current test is flaky, removing two potential causes of flakiness.

Test Plan:
CI

Imported from OSS

Differential Revision: D21434861

fbshipit-source-id: 82ea5762f3bb07a12052cde29729d73e95da8ddd
2020-05-06 18:59:54 -07:00
675e77e88a add docker image build ubuntu16.04-cuda9.2-cudnn7-gcc5.4-py3.6 (#37610)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/37610

Test Plan: Imported from OSS

Differential Revision: D21336643

Pulled By: glaringlee

fbshipit-source-id: 92340a9e83c79c199f4a739a24857be08ca28e19
2020-05-06 18:24:05 -07:00
35693e9b4b Give at::cuda::blas::gemv<at::Half> parity with <float> and <double>. Nature is healing. (#37569)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/37157 on my machine.

This was annoying to track down.  The essence is that cublas expects column major inputs and Pytorch tensors are usually row major.  Cublas lets you request that it act on transposed data, and the erroring `gemv` calls in https://github.com/pytorch/pytorch/issues/37157 make that request.  The problem is, [cublasSgemv and cublasDgemv](https://docs.nvidia.com/cuda/cublas/index.html#cublas-lt-t-gt-gemv) (called by [`gemv<float>`](091a1192d7/aten/src/ATen/cuda/CUDABlas.cpp (L318)) and `gemv<double>`) regard their `m, n` arguments values as _pre_-transpose sizes, while [cublasGemmEx](https://docs.nvidia.com/cuda/cublas/index.html#cublas-GemmEx) (called by `gemv<at::Half>`, see [here](091a1192d7/aten/src/ATen/cuda/CUDABlas.cpp (L342)) and [here](091a1192d7/aten/src/ATen/cuda/CUDABlas.cpp (L229))) regards its `m, k` argument values as _post_-transpose sizes.  This is inconsistent.  It turns out the `gemv<float>/<double>` calls are configured correctly and the `gemv<at::Half>` calls aren't.

Strikethrough text below is no longer accurate, ngimel suggested a better way to handle gemv->gemm forwarding.  [Comments in code](https://github.com/pytorch/pytorch/pull/37569/files#diff-686aa86335f96b4ecb9b37f562feed12R323-R348) provide an up-to-date explanation.

Keeping out-of-date strikethrough text because I don't have the heart to delete it all and because it captures an intermediate state of my brain that will help orient me if i ever have to fix this again.

~~To convince myself this PR keeps `at::cuda::blas::gemv`'s external API consistent across dtypes, I need to think through what happens when a pytorch tensor input of size `(a,b)` multiples a vector of size `(b,)` for 4 cases:~~

### ~~1. input is row-major (needs cublas internal transpose)~~
#### ~~1a. input is float or double~~
~~`gemv<float>/<double>` call `cublasS/Dgemv`, forwarding `trans`,** `m`, and `n` directly.~~

~~`cublasS/Ggemv` expects "a m × n matrix stored in column-major format" (so m is the input's fast dim).  Input has size `(a, b)` in row-major format.  We can reinterpret it as a column-major matrix with size `(b, a)` without any memory movement.  So the gemv call should supply `m=b`, `n=a`.  However, we're not trying to multiply a matrix `(b, a)` x a vector `(b,)`, we're trying to sum across `b` for matrix and vector.  So we also request that cublas transpose the matrix internally by supplying `trans='t'` to `blas::gemv`, which becomes `trans=CUBLAS_OP_T` to the `cublasS/Ggemv`.~~

~~As long as the code calling `blas::gemv` thinks carefully and passes `trans='t'`, `m=b`, `n=a`, cublas carries out `(a, b) x (b,)` and all is well.~~

#### ~~1b. input is half or bfloat16~~
~~`blas::gemv<at::Half>` takes a different code path, calling `gemm<at::Half>` which calls `cublasGemmEx`.  The job of this PR is to make sure the exterior `blas::gemv` caller's carefully thought-out argument choices (`trans='t'`, `m=b`, `n=a`) remain correct.~~

~~`cublasGemmEx` takes args `transa, transb, m, n, k, ....others we don't care about` and carries out~~
```
 C = α op ( A ) op ( B ) + β C
where α and β are scalars, and A , B and C are matrices stored in column-major format with
dimensions op ( A ) m × k , op ( B ) k × n and C m × n Also, for matrix A
           A if  transa == CUBLAS_OP_N
op ( A ) = A^T if  transa == CUBLAS_OP_T ...
```
~~`gemv<at::Half>` hacks a gemv by calling gemm such that the raw gemm's `m` is the output dim, `k` is the summed dim, and `n=1`, .  Reasonable, as long as we get the values right, given that we also need to transpose the input.~~

~~To conform with cublas docs we interpret input as column-major with size `(b, a)`.  As for the `<float>/<double>` gemv we want cublas to carry out input (interpreted as column major), internally transposed, times vector of size `(b,)`.  In other words we want cublas to apply `op(A) x B`, where op is transpose and `A` is input interpreted as column major.  Docs define `m` and `k` by saying `op(A)` has dims `m x k` **(`m` and `k` are _post_-`op` sizes)**.  `A` was `(b, a)`, `op(A)` is `(a, b)`, so the correct thing is to supply `m=a`, `k=b` to the underlying gemm.  **For the `<float>/<double>` gemv, we passed `m=b`, not `m=a`, to the raw `cublasS/Dgemv`.**~~

~~The exterior `blas::gemv` must have been called with `trans='t'`, `m=b`, `n=a` (as required by the `<float>/<double>` versions).  So when gemv is about to call gemm, **we [swap](https://github.com/pytorch/pytorch/pull/37569/files#diff-686aa86335f96b4ecb9b37f562feed12R330) the local values of `m` and `n` so that `m=a`, `n=b`,** then put `m (=a)` in the gemm's `m` spot, 1 in the gemm's `n` spot, and `n (=b)` in the gemm's `k` spot.  All is well (we made the right gemm call after ingesting the same arg values as `blas::gemv<float>/<double>`).~~

### ~~2. input is column-major (doesn't need cublas transpose)~~
#### ~~2a. input is float or double~~
~~input is `(a,b)`, already column-major with strides `(1,a)`.  Code calling `blas::gemv` supplies `trans='n'` (which becomes `CUBLAS_OP_N`, no internal transpose), `m=a`, `n=b`.~~

#### ~~2b. input is half or bfloat16~~
~~`blas::gemv` should pass `transa='n'`, `m=a`, `n=1`, `k=b` to the underlying gemm. The exterior `blas::gemv` must have been called with `trans='t'`, `m=a`, `n=b` (as required by the `<float>/<double>` versions). So **in this case we _don't_ swap `blas::gemv`'s local values of `m` and `n`.** We directly put `m (=a)` in the gemm's `m` spot, 1 in the gemm's `n` spot, and `n (=b)` in the gemm's `k` spot. All is well (we made the right gemm call after ingesting the same arg values as `blas::gemv<float>/<double>`).~~

~~** `trans` is a string `t` or `n` in the `at::cuda::blas::gemv` API, which gets [converted](091a1192d7/aten/src/ATen/cuda/CUDABlas.cpp (L314)) to a corresponding cublas enum value `CUBLAS_OP_T` (do transpose internally) or `CUBLAS_OP_N` (don't transpose internally) just before the raw cublas call.~~
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37569

Differential Revision: D21405955

Pulled By: ngimel

fbshipit-source-id: e831414bbf54860fb7a4dd8d5666ef8081acd3ee
2020-05-06 18:19:30 -07:00
28ed04c620 [JIT] remove list_with_default op (#37886)
Summary:
We can implement this as a builtin instead of as a registered op.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37886

Differential Revision: D21414329

Pulled By: eellison

fbshipit-source-id: 6e130fa83fbf7ba4d4601f509cb169a2fa804108
2020-05-06 17:32:11 -07:00
f538cd627a Install HugePagesArena to optimize pytorch prediction performance (#37640)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37640

Enable oversize arena to reduce memory fragmentation. Memory request with large size (configurable with FLAGS_caffe2_oversize_threshold) are fulfilled from dedicated arena separate from the existing huge page arena.

Two additional parameters are introduced to configure the 2-phase decay of the memory arena:
- caffe2_dirty_decay_ms
- caffe2_muzzy_decay_ms

In current JEMalloc implementation, oversized allocations will be immediately purged regardless of putting it in arena or not. Therefore we need to extend the decay time to indefinite. Currently we set the default for caffe2_muzzy_decay_ms to -1.

We now enable the arena allocator statically. To ensure it is correctly installed regardless of static initialization order, we add a priority flag in c10::SetAllocator, and only higher priority allocators can overwrite existing ones.
ghstack-source-id: 103276877

Test Plan:
buck test mode/dev //caffe2/caffe2/fb/init:huge_pages_allocator_test

Benchmarking known CV model that benefits from page arena:
```
PyTorchModelBench.cpp:183] test / base : 86.9532%
```

By adjusting ```dirty_decay_ms``` and ```muzzy_decay_ms```, we have the following plots:
https://pxl.cl/15SWW
https://pxl.cl/15TnL

From the figures above we can see performance does not change much until dirty decay time is indefinite (set to -1). Either setting muzzy decay or dirty decay time to -1 will reach best performance, regardless of which one it is. Even setting the decay time to very long (100s, which is longer than the run), does not change the performance by much.

## Observe performance difference in production with a variety of models (WIP)

Reviewed By: dzhulgakov

Differential Revision: D21258581

fbshipit-source-id: c006f8b94f28aef0666e52f48d4e82cf0d3a48af
2020-05-06 17:27:10 -07:00
3cc5062544 Update bazel to 3.1.0 (#37951)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/37951

Differential Revision: D21439260

Pulled By: malfet

fbshipit-source-id: 77bcb5a28a29482f6e44c01e3dafd24d24ee7ec3
2020-05-06 17:00:38 -07:00
56fc347e49 [quant][fix] A typo in quantized::conv2d_relu (#37964)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37964

I thought it was because of flakiness that we didn't pass the conv2d_relu test, but turns out to
be a typo in the implementation
Also re-enabled the `use_fused` option in `test_conv2d_api`

Test Plan:
.

Imported from OSS

Differential Revision: D21434776

fbshipit-source-id: 7c24c13cde0a96807e8dfbd1deabf47e8280fdb7
2020-05-06 16:38:26 -07:00
f29f96d47b Port existing zero_dim_dispatch optimizations from codegen and remove codegen capability. (#37615)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37615

We probably missed a lot of these when we ported things from TH, but it's also probably not a huge deal.  There is only one left with fmod.

Test Plan: Imported from OSS

Reviewed By: bhosmer

Differential Revision: D21338030

Pulled By: gchanan

fbshipit-source-id: c133b4e37df87a53797939e9f757cea9446834e8
2020-05-06 15:59:42 -07:00
f5b3125af7 [JIT] Peephole optimize list ops (#37612)
Summary:
Peephole optimize  `len(li)` and `li[index]` patterns.

This changes the Profiled Graph IR for the following tests:
```
(Test Name, Num ifs loops, Num non-tensor nodes)
Before:
('test_nn_Conv1d_reflect_stride2_pad2', 3, 14)
('test_nn_Conv2d_reflect_stride2_pad2', 3, 14)
('test_nn_Conv1d_circular_stride2_pad2', 5, 31)
('test_nn_Conv2d_circular_stride2_pad2', 5, 31)
('test_nn_Conv3d_circular_stride2_pad2', 5, 31)
('test_nn_Conv1d_replicate_stride2_pad2', 3, 14)
('test_nn_Conv2d_replicate_stride2_pad2', 3, 14)
('test_nn_Conv3d_replicate_stride2_pad2', 3, 14)
After
('test_nn_Conv1d_reflect_stride2_pad2', 0, 2)
('test_nn_Conv2d_reflect_stride2_pad2', 0, 2)
('test_nn_Conv1d_circular_stride2_pad2', 0, 4)
('test_nn_Conv2d_circular_stride2_pad2', 0, 7)
('test_nn_Conv3d_circular_stride2_pad2', 0, 10)
('test_nn_Conv1d_replicate_stride2_pad2', 0, 2)
('test_nn_Conv2d_replicate_stride2_pad2', 0, 2)
('test_nn_Conv3d_replicate_stride2_pad2', 0, 2)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37612

Differential Revision: D21352676

Pulled By: eellison

fbshipit-source-id: f8a0e7653b7a6a4c769f075de9b3044242ca9336
2020-05-06 15:55:18 -07:00
bf970bce21 Migrate some CUDA arithmetic kernels to c10::complex (#37878)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/37878

Test Plan: Imported from OSS

Differential Revision: D21426621

Pulled By: anjali411

fbshipit-source-id: 6cdf0ee7320e5c4c2864331b1eaff4201d74ccf7
2020-05-06 15:51:15 -07:00
4bbf889bcf [jit][api][refactor] remove redundant deepcopy implementation (#37538)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/37538

Test Plan:
.

Imported from OSS

Differential Revision: D21431011

fbshipit-source-id: 9dedccb19c7d43999a756e3f5076846527e2f6ca
2020-05-06 15:41:33 -07:00
cd0724f9f1 Do not std::move returned value (#37891)
Summary:
This prevents compiler to use copy elision and triggers `redundant move in return statement` warning.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37891

Differential Revision: D21417998

Pulled By: malfet

fbshipit-source-id: 4008a6442cee3fe710c2da252b1bde7b4293b63f
2020-05-06 15:38:05 -07:00
728189588e [reland][quant][graphmode] Support a new category of ops in graph mode quantization (#37936)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37936

Previously we classify ops like average pool to the category that doesn't require observation and
the quantization of these ops are done by swapping with dequantize ops: https://github.com/pytorch/pytorch/pull/33481
However, this operation is done in finalize, which means finalize is a numerics changing pass when we swap dequantize with
ops like average pool, this is not ideal since we want to restrict the scope of numerics changing passes.
Because although average pool doesn't require observation, quantized average pool = dequant + float32 average pool + quant
and swapping average pool with dequantize is a numerics changing operation.

This PR implements the support for that. We'll classify ops like average pool to a new category and we'll get average pool through fusion, like we did for other quantized ops. And the numerics changing pass will only happen in insert quant dequant pass, so the model will have the same numerics before and after finalize. With the new category, the debug only option(the model before finalize) for quantize_script will actually produce a model that's numerically consistent with the finalized model.

Test Plan: python test/test_quantization.py TestQuantizeScriptJitPasses

Differential Revision: D21432871

Pulled By: jerryzh168

fbshipit-source-id: 4926890441e39af4e459376038563c3882cc4c46
2020-05-06 15:36:29 -07:00
ec9342521b [TensorExpr] Support Bool dtype in Or, Xor, And ops and in TensorExprKernel::bindInput. (#37938)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/37938

Test Plan: Imported from OSS

Differential Revision: D21428552

Pulled By: ZolotukhinM

fbshipit-source-id: f9840b2f090150da01b172a31618ea261da19ff4
2020-05-06 15:28:23 -07:00
a3042ca89d [JIT] Rewrite unaliased if output mutation (#37694)
Summary:
In a case like below,  if x0 and x1 are both unaliased an only have a single use, than we can rewite the mutation to x2 without breaking observable semantics. This PR makes torchvision.models.alexnet functionalizable.
```
if cond:
    x0 = op()
else:
    x1 = op()
x2.add_(1)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37694

Differential Revision: D21428275

Pulled By: eellison

fbshipit-source-id: 1e2a39a8fb3819f1f225b7c345e986b3a3db253f
2020-05-06 15:26:31 -07:00
b53e6bfd49 [jit] normalize getMethod (#37472)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37472

Our convention is for `findX` to return an optional version and `getX`
to assert that the X is there. Fix up `getMethod` to be consistent with
this convention.

Test Plan: Imported from OSS

Differential Revision: D21297543

Pulled By: suo

fbshipit-source-id: b40f56231cc8183e61bbb01fe5c0c113bcb6464d
2020-05-06 15:22:25 -07:00
28ac5cdc91 fix profiling test (#37961)
Summary:
this is failing in the profiling_executor job
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37961

Differential Revision: D21434341

Pulled By: eellison

fbshipit-source-id: b34f94b1595ef6f6edee76cd200f951a2ef21f22
2020-05-06 15:04:44 -07:00
6293f1fb49 Migrate cpu kernel for index and index_put to c10::complex (#37877)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/37877

Test Plan: Imported from OSS

Differential Revision: D21426613

Pulled By: anjali411

fbshipit-source-id: 1bbdc0b0fc38df8a135c4cc29440b767b675324c
2020-05-06 14:51:40 -07:00
ae308db681 fix lilstm test in tensorexpr_te (#37913)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/37913

Reviewed By: ZolotukhinM

Differential Revision: D21428329

Pulled By: Krovatkin

fbshipit-source-id: eefba49b59dc76a6efaad85f03a4c12b889b60a9
2020-05-06 14:44:28 -07:00
ab2373205f Create a desktop shortcut for restoring pytorch environment on CircleCI (#37926)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/37926

Differential Revision: D21434713

Pulled By: ezyang

fbshipit-source-id: 20687c632547a287ce8ed4c0fc692e2210bb5871
2020-05-06 14:38:49 -07:00
945672bf3e cmake: improve dependencies in incremental builds (#37661)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/26304

Test procedure:
With ninja:
[x] Build a clean checkout
[x] Build again. Result: Only 10 libraries are (needlessly) linked again, the extra delay on a 24-core machine is <10s.
[x] Build for the third time. Result: Virtually instantaneous, with no extra rebuilding.
[x] Modify DispatchTable.h. Build again. Result: `.cu` files are rebuilt, as well as many `.cpp` files
[x] Build for the fifth time. Result: Virtually instantaneous, with no extra rebuilding.
[x] Touch one of the `.depend` files. Build again. Result: Only 10 libraries are (needlessly) linked again, the extra delay on a 24-core machine is <10s.

Without ninja:
[x] Build a clean checkout
[x] Build again. Result: There is some unnecessary rebuilding. But it was also happening before this change.
[x] Build for the third time. Result: Virtually instantaneous, with no extra rebuilding.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37661

Differential Revision: D21434624

Pulled By: ezyang

fbshipit-source-id: 379d2315486b8bb5972c184f9b8da8e00d38c338
2020-05-06 14:25:18 -07:00
4c4816ad07 [CPU] addmv for complex tensors (#37924)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/37924

Test Plan: Imported from OSS

Differential Revision: D21429384

Pulled By: anjali411

fbshipit-source-id: 8b1b76ed13d2e5785a4d552aedb2e6f58d304c46
2020-05-06 14:13:05 -07:00
7a408576dd Stopgap fix to determine_target predicate (#37934)
Summary:
This makes it a proper python package, therefore `ModuleFinder` will parse dependencies from this module. (see  https://docs.python.org/3/tutorial/modules.html )

As result, changes to `torch/testing/_internal/common_quantization` or `test/quantization/*.py` would be considered affecting `test_quantization.py`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/37934

Test Plan: CI

Differential Revision: D21432413

Pulled By: malfet

fbshipit-source-id: acff6cee69a1dfd5535e33978f826ed1f6a70821
2020-05-06 14:05:14 -07:00
1ad46f470f [jit] __copy__ for RecursiveScriptModule (#36830)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/36830

Test Plan:
build/bin/test_jit

Imported from OSS

Differential Revision: D21431012

fbshipit-source-id: 13a1bf9744ec95ea59622226c8d8a8d55ec3f0b0
2020-05-06 13:55:01 -07:00
b1b6bc36a5 Enable xnnpack_integration test in CI. (#37838)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/37838

Test Plan: oss: python test/test_xnnpack_integration.py

Reviewed By: xcheng16

Differential Revision: D21405850

fbshipit-source-id: ba4ba06692b49315f110653d9492b2e14b618574
2020-05-06 13:53:03 -07:00
d6b51e4adf In interpolate, join short lines (#37170)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37170

ghstack-source-id: 102773588

Test Plan: CI

Reviewed By: kimishpatel

Differential Revision: D21209998

fbshipit-source-id: 9386e54aa85a5576678d21d443017079028f8dca
2020-05-06 13:03:45 -07:00
59f03c69ab In interpolate, give a short name to scale_factor_list (#37169)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37169

This allows some cleanup of the code below by making lines shorter.
ghstack-source-id: 102773593

Test Plan: Existing tests for interpolate.

Reviewed By: kimishpatel

Differential Revision: D21209988

fbshipit-source-id: cffcdf9a580b15c4f1fa83e3f27b5a69f66bf6f7
2020-05-06 13:03:39 -07:00
4996961826 In interpolate, only call _interp_output_size in one place (#37168)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37168

It looks like this was made a separate function because of the `dim` argument,
but that argument is always equal to `input.dim() - 2`.  Remove the argument
and consolidate all call sites into one.  This also means that this will be
called on paths that previously didn't call it, but all those cases throw
exceptions anyway.
ghstack-source-id: 102773596

Test Plan: Existing tests for interpolate.

Reviewed By: kimishpatel

Differential Revision: D21209993

fbshipit-source-id: 2c274a3a6900ebfdb8d60b311a4c3bd956fa7c37
2020-05-06 13:03:33 -07:00
8749aa2d55 Clean up formatting in upsample ops (#37166)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37166

ghstack-source-id: 102773597

Test Plan: CI

Reviewed By: kimishpatel

Differential Revision: D21210001

fbshipit-source-id: 8e65d638dea72d995d6c079ed8c0b03be0fb813c
2020-05-06 13:03:28 -07:00
78529f6de7 Whitespace cleanup (#37165)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37165

ghstack-source-id: 102773591

Test Plan: CI

Reviewed By: kimishpatel

Differential Revision: D21209997

fbshipit-source-id: c5eef259aade2ad66095231e139ba125e759445b
2020-05-06 13:01:56 -07:00
5edf5efd37 Migrate CPU sum, eq, and ne to c10::complex (#37876)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/37876

Test Plan: Imported from OSS

Differential Revision: D21426516

Pulled By: anjali411

fbshipit-source-id: 0532e5508ad65e649f3d4d8cde32ff871956c9f7
2020-05-06 12:21:36 -07:00
4e2ea6e013 [TensorExpr] Remove the Tensor argument from loopnest.reorderAxis (#37873)
Summary:
Remove the requirement for the axes provided to reorderAxis to come from a Tensor. We were using that to determine the relevant loops, but we can alternatively determine it by traversing the parents of each provided For.

resistor does this work for you?
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37873

Differential Revision: D21428016

Pulled By: nickgg

fbshipit-source-id: b16b2f41cb443dfc2c6548b7980731d1e7d89a35
2020-05-06 12:02:15 -07:00
53e7d49a98 Port register_prim_ops_c10.cpp to new registration API (#37834)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37834

Ported all use sites of the old registration API to use new Integer operator registration API.

Test Plan: Imported from OSS

Differential Revision: D21415700

Pulled By: MohammadMahdiJavanmard

fbshipit-source-id: 34f18757bad1642e1c485bb30c9771f7b7102230
2020-05-06 11:44:37 -07:00
0e3a05ec00 [JIT] rename enable_profiling_mode to enable_profiling_mode_for_profiling_tests (#37825)
Summary:
The existing contextmanager only conditionally enabled_profiling_mode, which was counter intuitive. When we changed the default executor it broke internal benchmarking as a result.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37825

Differential Revision: D21404611

Pulled By: eellison

fbshipit-source-id: 306b3c333ef4eb44ab6a6e5ab4e0682e5ce312ce
2020-05-06 11:30:02 -07:00
436cd2c02d Migrate check_convert to c10::complex (#37875)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/37875

Test Plan: Imported from OSS

Differential Revision: D21426480

Pulled By: anjali411

fbshipit-source-id: e9a474b4f7524aeeb6c63976ff7de9ac38ecefab
2020-05-06 11:13:12 -07:00
8434247653 modify select_equals_backward to propage only to a single value (#36316)
Summary:
Renames `select_equals_backward` to `select_first_equal_backward` and makes sure it propagates to a single value.
Fixes [https://github.com/pytorch/pytorch/issues/35699](https://github.com/pytorch/pytorch/issues/35699).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36316

Differential Revision: D21403848

Pulled By: albanD

fbshipit-source-id: b260cd79289162ee5733887d2afe8203945baee6
2020-05-06 10:50:24 -07:00
dd618216c5 [JIT]Support adv indexing using list. (#37848)
Summary:
We used to only support indexing through
- numbers like `x[0, 1]`
- tuple like `x[(0, 1)]`
- tensor like `x[torch.tensor([0, 1])]`

This PR adds support for indexing through list which is equivalent to tensor.
- `x[[0, 1, 5]]`
- `x[[0, 1], [0, 1]]`
- `x[[[0, 1], [0, 1]], [[0, 1], [0, 1]]]`

Note for `x[[0, 1, 5]]` we had a bug in AST conversion code so we used to treat it like `x[0, 1, 5]` which means it might accidentally run and produce wrong result(fixes https://github.com/pytorch/pytorch/issues/37286 fixes https://github.com/pytorch/pytorch/issues/18616), now that it's fixed we probably want to mark it as BC breaking.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37848

Reviewed By: suo

Differential Revision: D21409840

Pulled By: ailzhang

fbshipit-source-id: 6f2d962885c6dc009cb384d98be1822f5ca7a189
2020-05-06 10:44:48 -07:00
f2148de92f Revert D21409626: [quant][tests] Enable tests to run on all qengine backends
Test Plan: revert-hammer

Differential Revision:
D21409626

Original commit changeset: 21b23e498f43

fbshipit-source-id: 44cb6d1087c521926c56fa4148c2eb897e03bb98
2020-05-06 10:37:41 -07:00
ec7fd0caef [docs] Fix broken links in contribution_guide.rst and governance.rst (#37820)
Summary:
Fix https://github.com/pytorch/pytorch/issues/37716

Fix three broken links in the documentation:
- [PyTorch Governance](https://pytorch.org/docs/source/community/governance.rst) in the [Contribution Guide page](https://pytorch.org/docs/master/community/contribution_guide.html#the-pytorch-contribution-process)
- [PyTorch Governance | Persons of Interest](https://pytorch.org/docs/source/community/persons_of_interest.rst) under the [Core Developer section](https://pytorch.org/docs/master/community/governance.html#core-developers)
- [PyTorch Contributor Guide](https://pytorch.org/docs/source/community/contribution_guide.rst) under the [FAQ session of the Governance Page](https://pytorch.org/docs/master/community/governance.html#faq)

The old link leads to the `.rst` source file, which does not exist on the server.

It's now fixed using the [document cross-referencing syntax](https://www.sphinx-doc.org/en/1.8/usage/restructuredtext/roles.html#cross-referencing-documents)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37820

Differential Revision: D21414579

Pulled By: mruberry

fbshipit-source-id: ecf6de9317ce93f70205cbfe97a3bdd54e635fe5
2020-05-06 10:33:33 -07:00
e729db48ca Remove requantization scale constraint. (#37683)
Summary:
Now that we landed float requantization for conv/linear, we do not need
the constraint for requant_scale < 1.
Removing that.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37683

Test Plan: Quantization tests

Differential Revision: D21412536

Pulled By: kimishpatel

fbshipit-source-id: c932b5ab3aa40407e9d7f0c877e2fe7fd544f8a7
2020-05-06 10:23:08 -07:00
6f06df8193 Fix lint (#37922)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37922

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Differential Revision: D21426425

Pulled By: ezyang

fbshipit-source-id: 9d0d997f608a742668f64e7529c41feb39bec24e
2020-05-06 09:29:34 -07:00
122d8215a3 [RESUBMIT] Kill broadcasting from the codegen layer. (#37907)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/37907

Test Plan: Imported from OSS

Differential Revision: D21420872

Pulled By: gchanan

fbshipit-source-id: c782c0c438bcb7e764a97b446f8c3cd168e188f0
2020-05-06 08:54:47 -07:00
88c447bf71 Change DeprecationWarning to UserWarning in torch.cuda (#32142)
Summary:
Follow-up of https://github.com/pytorch/pytorch/issues/27361 .

Addresses https://github.com/pytorch/pytorch/issues/32141 .
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32142

Differential Revision: D19404540

Pulled By: gchanan

fbshipit-source-id: f0b230a3224004286064da2b617ff471ba272f47
2020-05-06 08:28:43 -07:00
f78d02ed51 [quant][tests] Enable tests to run on all qengine backends (#37843)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37843

Refactor tests to use supported_qengines

Test Plan:
python test/test_quantization.py

Imported from OSS

Differential Revision: D21409626

fbshipit-source-id: 21b23e498f4359c7ea7430c86f931dd534ddfdb7
2020-05-06 07:51:29 -07:00
2f61b04514 Add Aten as dep to fakelowp and cpuinfo path to its include path (#37909)
Summary:
yinghai please review
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37909

Reviewed By: hyuen

Differential Revision: D21422399

Pulled By: yinghai

fbshipit-source-id: 2dfce909fe11a12404d16286e77e81dd46dfda52
2020-05-06 06:32:13 -07:00
75c201ac32 Fix some amount of support for Bool in tensorexpr. (#37914)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/37914

Reviewed By: ZolotukhinM

Differential Revision: D21421402

Pulled By: resistor

fbshipit-source-id: 825391843d74fee3a23a934c859d867ef3cffde9
2020-05-06 02:04:48 -07:00
cdc56d0b6c Support c10::optional<Tensor> in custom C++ autograd function. (#37700)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37700

Certain autograd functions can have optional Tensor arguments. For
this purpose it would be nice to support c10::optional<Tensor> as an argument
for C++ autograd functions.

I've added the appropriate overload to ExtractVariables to ensure this works.
For an example, you can look at D21272807 in terms of how this is used.
ghstack-source-id: 103541789

Test Plan: waitforbuildbot

Differential Revision: D21363491

fbshipit-source-id: 0c8665e9bfe279e6b9ab84a889524fea11fa971c
2020-05-06 01:59:51 -07:00
b57b596f20 Reduction should not coalesce_dimensions when splitting for 32bit indexing (#37788)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/37583
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37788

Differential Revision: D21387325

Pulled By: ngimel

fbshipit-source-id: dbd0f5a23e06d8c4cc68cd21b09b4b0221c4bba7
2020-05-05 23:44:00 -07:00
222fdd4227 Updating submodules
Summary:
GitHub commits:

8eb845b08d
40f530d566

Test Plan: n/a

Reviewed By: jurajh-fb

fbshipit-source-id: d73a0ab8a9ab28196e88b40bb31fe93bf20378ba
2020-05-05 23:36:49 -07:00
ad2305e556 Revert D21393512: [quant][graphmode] Support a new category of ops in graph mode quantization
Test Plan: revert-hammer

Differential Revision:
D21393512

Original commit changeset: 5632935fe1a7

fbshipit-source-id: 6e43897ee59924656af18a7f2c95c13bb4b48311
2020-05-05 22:51:40 -07:00
fe88806784 Back out "Revert D21171334: [pytorch][PR] Change StorageImpl to track byte count rather than element count" (#37893)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37893

Original commit changeset: 50746043acf3

Test Plan: sandcastle and ossci

Reviewed By: malfet, seemethere, ngimel

Differential Revision: D21416509

fbshipit-source-id: 735ec4e61f9d36d4537f52dd2dc6267751aeb94b
2020-05-05 22:43:15 -07:00
8c91b78277 [TensorExpr] Fix the shape info check in the TE fuser pass. (#37882)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37882

Previously we checked if a node's inputs and outputs have shape
info only when we tried to merge this node into an existing fusion
group, but we didn't check it for the first node in the group. This PR
fixes that. It was causing a failure on test_milstm_cuda, which is now
fixed.

Test Plan: Imported from OSS

Reviewed By: Krovatkin

Differential Revision: D21412756

Pulled By: ZolotukhinM

fbshipit-source-id: 3ca30637ab8fe68443adb5fc03f1b8a11085a6a8
2020-05-05 22:34:59 -07:00
e3934dfae8 [ROCm] Enable bfloat16 for ops in BERT model (#37634)
Summary:
Enables bfloat16 type for ops present in BERT model.
Enabled relevant unit tests.

ezyang jeffdaily
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37634

Differential Revision: D21413957

Pulled By: ezyang

fbshipit-source-id: 19309fe46b4a2f07922bf5b32fee2066df514aeb
2020-05-05 21:24:56 -07:00
402f635bbe Enable ahead of time compilation for HIPExtensions using ninja (#37800)
Summary:
This pull request enables ahead of time compilation of HIPExtensions with ninja by setting appropriate compilation flags for ROCm environment. Also, this enables the unit test for testing cuda_extensions on ROCm as well as removing test for ahead of time compilation of extensions with ninja from ROCM_BLACKLIST

ezyang jeffdaily
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37800

Differential Revision: D21408148

Pulled By: soumith

fbshipit-source-id: 146f4ffb3418f3534e6ce86805d3fe9c3eae84e1
2020-05-05 20:53:35 -07:00
70f375becf [quant] ConvPackedParams with TorchBind (#35923)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35923

(Note: this ignores all push blocking failures!)

Test Plan:
tbd

Imported from OSS

Differential Revision: D20957089

fbshipit-source-id: 74d8bd628ccba64e902ea6ebabc2b883924050b0
2020-05-05 20:18:36 -07:00
32b09f7ab9 Devirtualize device init calls in factory op wrappers (#37815)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37815

Generated device-specific wrappers for Tensor factory ops now call
methods on `globalContext()` directly, rather than indirecting
through `globalLegacyTypeDispatch()`, which we can now delete.

Test Plan: Imported from OSS

Differential Revision: D21398294

Pulled By: bhosmer

fbshipit-source-id: b37bc67aa33bfda6f156d441df55ada40e9b814d
2020-05-05 19:56:45 -07:00
9f060d3873 [Caffe2] Increase timing threshold to 50 ms on Windows (#37892)
Summary:
Helps prevent following accidental failures:
```
..\caffe2\core\parallel_net_test.cc:303
The difference between ms and 350 is 41, which exceeds kTimeThreshold, where
ms evaluates to 391,
350 evaluates to 350, and
kTimeThreshold evaluates to 40.
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37892

Differential Revision: D21417251

Pulled By: malfet

fbshipit-source-id: 300cff7042e466f014850cc7cc406c725d5d0c04
2020-05-05 19:45:36 -07:00
5eacc9cb57 [quant][graphmode] Support a new category of ops in graph mode quantization (#37515)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37515

Previously we classify ops like average pool to the category that doesn't require observation and
the quantization of these ops are done by swapping with dequantize ops: https://github.com/pytorch/pytorch/pull/33481
However, this operation is done in finalize, which means finalize is a numerics changing pass when we swap dequantize with
ops like average pool, this is not ideal since we want to restrict the scope of numerics changing passes.
Because although average pool doesn't require observation, quantized average pool = dequant + float32 average pool + quant
and swapping average pool with dequantize is a numerics changing operation.

This PR implements the support for that. We'll classify ops like average pool to a new category and we'll get average pool through fusion, like we did for other quantized ops. And the numerics changing pass will only happen in insert quant dequant pass, so the model will have the same numerics before and after finalize. With the new category, the debug only option(the model before finalize) for quantize_script will actually produce a model that's numerically consistent with the finalized model.

Test Plan:
python test/test_quantization.py TestQuantizeScriptJitPasses

Imported from OSS

Differential Revision: D21393512

fbshipit-source-id: 5632935fe1a7d76382fda22903d77586a08f0898
2020-05-05 19:04:53 -07:00
480bd0ad50 Stop defining static data in Vec256 (#37767)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37767

Fixes #37577

Needs tests, and maybe a lint.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Differential Revision: D21386704

Pulled By: ezyang

fbshipit-source-id: 082c69f9e1f40dc5ed7d371902a4c498f105d99f
2020-05-05 18:46:40 -07:00
96b512be07 fix msan in vec_reduce_all (#37853)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37853

```
Uninitialized value was created by an allocation of 'acc_arr_next' in the stack frame of function '_ZN2at6vec25614vec_reduce_allIfZZNS_6native12_GLOBAL__N_124_vec_log_softmax_lastdimIfEEvPT_S6_llENKUlllE_clEllEUlRNS0_12_GLOBAL__N_16Vec256IfEESB_E_EES5_RKT0_NS9_IS5_EEl'
    #0 0xa961530 in float at::vec256::vec_reduce_all<float, void at::native::(anonymous namespace)::_vec_log_softmax_lastdim<float>(float*, float*, long, long)::'lambda'(long, long)::operator()(long, long) const::'lambda'(at::vec256::(anonymous namespace)::Vec256<float>&, at::vec256::(anonymous namespace)::Vec256<float>&)>(void at::native::(anonymous namespace)::_vec_log_softmax_lastdim<float>(float*, float*, long, long)::'lambda'(long, long)::operator()(long, long) const::'lambda'(at::vec256::(anonymous namespace)::Vec256<float>&, at::vec256::(anonymous namespace)::Vec256<float>&) const&, at::vec256::(anonymous namespace)::Vec256<float>, long) xplat/caffe2/aten/src/ATen/cpu/vec256/functional.h:12
```

Test Plan:
passed sanitizer locally after change,
CI green

Differential Revision: D21408120

fbshipit-source-id: b9d058cedf42b3d1d34ce05a42049d402906cd13
2020-05-05 18:25:15 -07:00
e3d1c4eaac Revert D21310335: reenable quantization test_qadd_scalar_relu test
Test Plan: revert-hammer

Differential Revision:
D21310335

Original commit changeset: 99d22e61168f

fbshipit-source-id: 081b24ef0026ffb5fbb86d0654406b46e3d752eb
2020-05-05 18:02:15 -07:00
92f750b5c7 disable clang-tidy modernize-trailing-return (#37888)
Summary:
too much noise from this warning
![image](https://user-images.githubusercontent.com/5086322/81123764-b6e15900-8ee8-11ea-8f2f-49d69ddde25d.png)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37888

Differential Revision: D21415338

Pulled By: Krovatkin

fbshipit-source-id: 8d6f1be11d8419fa54a18e167929100401da439a
2020-05-05 17:40:22 -07:00
0359a9b0a0 Delay loading the cuda library on Windows (#37811)
Summary:
so we can import torch compiled with cuda on a CPU-only machine.
need tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37811

Differential Revision: D21417082

Pulled By: ezyang

fbshipit-source-id: 7a521b651bca7cbe38269915bd1d1b1bb756b45b
2020-05-05 17:28:28 -07:00
91c1505e5a Move addmm broadcasting code from codegen layer to native layer. (#37613)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/37613

Test Plan: Imported from OSS

Differential Revision: D21337341

Pulled By: gchanan

fbshipit-source-id: 064e983e0dc4334c5eed9df1af57bd7fc29d7a81
2020-05-05 17:15:48 -07:00
6792c3ad24 Move addbmm broadcasting from the codegen layer to native layer. (#37603)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/37603

Test Plan: Imported from OSS

Differential Revision: D21333923

Pulled By: gchanan

fbshipit-source-id: 6afb8f7b9931fd78064b4c759d38ffb0f4a6e293
2020-05-05 17:13:16 -07:00
b8d48d3680 Revert D21406034: [pytorch][PR] [BE] Add @skipIfNoFBGEMM decorator
Test Plan: revert-hammer

Differential Revision:
D21406034

Original commit changeset: 9583a8a726c2

fbshipit-source-id: ec891e5d00c78310b320f4901a261fc99fc5399b
2020-05-05 16:48:40 -07:00
34bf868ebc Fix weight quantization in RNNs (#35961)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35961

Weight quantization was done incorrectly for LSTMs, the statistics for all weights (across layers) were combined in the observer. This meant that weights for later layers in a LSTM would use sub-optimal scales impacting accuracy. The problem gets worse as the number of layers increases.
ghstack-source-id: 103511725

Test Plan: Will be updated

Differential Revision: D20842145

fbshipit-source-id: a622b012d393e0755970531583950b44f1964413
2020-05-05 16:40:16 -07:00
a2fc7f787a Revert D21171334: [pytorch][PR] Change StorageImpl to track byte count rather than element count
Test Plan: revert-hammer

Differential Revision:
D21171334

Original commit changeset: 37329a379de9

fbshipit-source-id: 50746043acf3c76754688de0fe6f1cc12437ea2f
2020-05-05 16:36:15 -07:00
563bbeb890 fix undef CUDA_VERSION warning (#37866)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37866

make sure not to check `CUDA_VERSION` if it is not defined

Test Plan: CI gree

Reviewed By: anjali411

Differential Revision: D21408844

fbshipit-source-id: 5a9afe372b3f1fbaf08a7c43fa3e0e654a569d5f
2020-05-05 16:31:24 -07:00
0cae718723 reenable quantization test_qadd_scalar_relu test (#37423)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37423

For now, see what breaks on CI
ghstack-source-id: 103508233

Test Plan:
CI

Imported from OSS

Differential Revision: D21310335

fbshipit-source-id: 99d22e61168fcb318b18a16522aabdc0115c1f39
2020-05-05 16:10:42 -07:00
b61fda2313 reenable quantized test_compare_tensor_scalar (#37422)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37422

The test was failing because in fbcode the version of hypothesis was too old to know
about the width parameter, and it was trying to generate values larger than float32.  The fix
is to explicitly set the defaults of the floats range for old versions of hypothesis.

For now, reenable the test and see what breaks in CI
ghstack-source-id: 103500358

Test Plan:
CI

```
buck test mode/dev-nosan //caffe2/test:quantization -- 'test_compare_tensor_scalar \(quantization\.test_quantized_op\.TestComparatorOps\)'
```

Imported from OSS

Reviewed By: raghuramank100

Differential Revision: D21310336

fbshipit-source-id: 1a59ab722daa28aab3d6d2d09bc527874942dc36
2020-05-05 16:09:08 -07:00
b57d82fcbb workaround nvcc host function bug (#37867)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37867

this is to work around internal issue we are hitting with nvcc in ovrsource.
It does not seem to overload to the correct device version of `isinf` and `isnan` without this fudging of the code.

Test Plan:
CI green,
internal builds pass

Reviewed By: malfet

Differential Revision: D21408263

fbshipit-source-id: 1ff44e088b5c885d729cc95f00cf8fa07e525f6d
2020-05-05 15:31:34 -07:00
30a65f1afa [Tensorpipe Agent] Call Shutdown from Destructor and Join (#37839)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37839

Calling `RpcAgent::shutdown` from the TensorpipeAgent will ensure that parent class threads are joined and the atomic is set to False.
ghstack-source-id: 103496383

Test Plan: CI Build - no Tensorpipe Agent tests yet

Differential Revision: D21291974

fbshipit-source-id: 50cab929b021faf7f80e0e8139d0c7d1788a3a6c
2020-05-05 15:25:45 -07:00
5325606c37 Add zero_mask() for Vec256<BFloat16> (#37114)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/37114

Test Plan: Imported from OSS

Differential Revision: D21351861

Pulled By: VitalyFedyunin

fbshipit-source-id: 4564624cb33555a3f026af25540b2df24edaecfb
2020-05-05 15:14:42 -07:00
4c009c7f3e Make aten_tensor_iterator ASAN safe (#37869)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37869

Return type of `cpu_serial_kernel` functor should match type of the Tensor
Closes https://github.com/pytorch/pytorch/issues/37490

Test Plan: CI

Differential Revision: D21410450

fbshipit-source-id: 78081d7478fc8126cbd497625ba60ed17e253314
2020-05-05 15:08:48 -07:00
27fc2ab9f4 [TensorExpr] Add a constructor accepting a name_hint to class Buf. (#36617)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/36617

Test Plan: Imported from OSS

Differential Revision: D21027355

Pulled By: ZolotukhinM

fbshipit-source-id: 54633f7400f24f7f9fdcaeead94c80282ccb5207
2020-05-05 15:06:10 -07:00
1c0bad25f3 [TensorExpr] Add dtype to class Buf. (#36611)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36611

Currently Buf represents underlying storage but it didn't have dtype.
That resulted in specifying dtypes in different places and there was no
mechanism to enforce its consistency: e.g. one could've created a kFloat
expression and use a kInt buffer to store its result. Now we're
centralizing where the logic regarding the storage is located and we can
start enforcing semantics rules.

Follow-ups: we can merge Buffer and BufHandle classes as the former is
now a mere wrapper over the latter.

Test Plan: Imported from OSS

Differential Revision: D21027356

Pulled By: ZolotukhinM

fbshipit-source-id: c06aa2c4077fdcde3bb4ca622d324aece79b5a9c
2020-05-05 15:04:37 -07:00
2c6aed0d61 [Testing] Add --save-xml option (#37840)
Summary:
Passing `--save-xml` option to common test runner would have the same effect as setting up `IN_CIRCLECI` environment variable, but also would allow one to specify folder to save results
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37840

Differential Revision: D21410250

Pulled By: malfet

fbshipit-source-id: ae5855fafdc8c66b550d42b683d547c88b4e55d9
2020-05-05 14:57:50 -07:00
a3639fa516 [Tensorpipe Agent] Adding Tensorpipe Codeowners (#37854)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37854

Adding Tensorpipe contributors to the Codeowners file for Tensorpipe-related functionality in PyTorch.
ghstack-source-id: 103507371

Test Plan: CI

Differential Revision: D21408676

fbshipit-source-id: ea7cc1fd7ec069c83e67812e704d31492ef2a3cf
2020-05-05 14:27:42 -07:00
3706803b60 Change StorageImpl to track byte count rather than element count (#37776)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37776

* Remove type-specific size tracking in favor of byte size tracking in Storage and StorageImpl
* Changed numel() and set_numel() to nbytes() and set_nbytes()
* Added enum argument to Storage/StorageImpl constructor to indicate new meaning of the size parameter
* Update all callers of the changed API

Part of issue https://github.com/pytorch/pytorch/issues/33950
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37028

Differential Revision: D21171334

Pulled By: ezyang

fbshipit-source-id: 37329a379de9a3a83cc5e9007e455a3e1c2d10b8
2020-05-05 14:20:51 -07:00
25ba802ce4 Fix cdist backward calculation for p=2 (#37337)
Summary:
Closes https://github.com/pytorch/pytorch/issues/37154

Fixes a bug in `cdist` backward with `p=2`.
Under some circumstances, if the output has 0s, the gradient calculation of `sqrt` will be undefined. Leading to NaNs in the input gradients.

This PR defines a subgradient for this case.

A test is also added to verify this behavior, I was only able to reproduce it under certain shapes, so the shape is explicitly taken from https://github.com/pytorch/pytorch/issues/37154 example
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37337

Differential Revision: D21403178

Pulled By: albanD

fbshipit-source-id: deef9678c1958524b552504920f19617f9ad1da6
2020-05-05 14:13:37 -07:00
06e1b68843 [BE] Add @skipIfNoFBGEMM decorator (#37810)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/37810

Differential Revision: D21406034

Pulled By: malfet

fbshipit-source-id: 9583a8a726c2e59e5173e114604e4edd979330c0
2020-05-05 14:00:52 -07:00
65291fd422 Remove unused capture in tensorpipe_agent.cpp (#37828)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/37828

Test Plan: Imported from OSS

Differential Revision: D21407942

Pulled By: mrshenli

fbshipit-source-id: 72c0cc36d6aa61d48c9108850f5e8ba1eb6a7507
2020-05-05 13:50:08 -07:00
bd220b336b [jit] fix trace checking reporting divergent names (#37842)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37842

Fixes https://github.com/pytorch/pytorch/issues/23993.

Previously our name lookup function for the tracer was looking in
f.globals for names. For example:
```
sample1 = torch.ones(1)
sample2 = torch.ones(1)
traced = torch.jit.trace(my_mod, ((sample1, sample2,),))
> produces a graph with something like:
> %sample1, %sample2 = prim::TupleUnpack(%input)
```

This is not great if you are, e.g. trace checking, because a non-local
bit of interpreter state is affected the graph produced:
```
traced = torch.jit.trace(my_mod, _clone_inputs((sample, sample,),))
> produces a graph with something like
> %0, %1 = prim::TupleUnpack(%input)
```
I have removed this functionality, as I don't think it provides huge
value. Things that look locally for names will still work, so e.g.
inputs, intermediate variables, and the like will be named correctly.

Test Plan: Imported from OSS

Differential Revision: D21406478

Pulled By: suo

fbshipit-source-id: 3c7066b95d4a6e9b528888309954b02dadbc1a07
2020-05-05 13:39:41 -07:00
9d7a79ac27 [Caffe2] raise exceptions instead of str (#37744)
Summary:
Some exceptions are not correctly wrapped inside a class.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37744

Differential Revision: D21388197

Pulled By: mrshenli

fbshipit-source-id: 2d69e2543c2e05116c367d137968b982c254d2dc
2020-05-05 13:34:33 -07:00
b57c8b720e [wip] Make quantization modules work with DataParallel (#37032)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37032

DataParallel requires all params and buffers of child modules to be updated
in place because of how it implements model replication during the
forward pass (see https://github.com/pytorch/pytorch/pull/12671 for
context). Any params or buffers not updated in place are lost and not
propagated back to the master.

This diff updates (some quantized modules) (TBD: all quantized modules? determine a good cut
point) to do their parameter update in-place. This will enable static
quant and QAT to work correctly with DataParallel.

TODO: https://github.com/pytorch/pytorch/pull/32684 needs to land before we can fix the graph mode test failures on this PR.

Test Plan:
script failed before and passes after the diff:
https://gist.github.com/vkuzo/78b06c01f23f98ee2aaaeb37e55f8d40

TODO before land: add integration testing

Imported from OSS

Differential Revision: D21206454

fbshipit-source-id: df6b4b04d0ae0f7ef582c82d81418163019e96f7
2020-05-05 13:06:43 -07:00
25e6129c52 quant BN tests: remove qint32 (#37832)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37832

These tests were flaky and qint32 support is not a priority at
the moment, turning it off to improve test quality.

Test Plan:
```
python test/test_quantization.py TestQuantizedOps.test_batch_norm2d
python test/test_quantization.py TestQuantizedOps.test_batch_norm2d_relu
python test/test_quantization.py TestQuantizedOps.test_batch_norm3d
```

Imported from OSS

Differential Revision: D21404980

fbshipit-source-id: 04f4308bc5d6e1a278c60985971d03c10a851915
2020-05-05 12:22:20 -07:00
08304ccccc add a cuda job for profiling tests (#37812)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/37812

Reviewed By: ZolotukhinM

Differential Revision: D21405933

Pulled By: Krovatkin

fbshipit-source-id: 2ba67afb80a6b34373559ccd66450fce1d3140eb
2020-05-05 12:10:35 -07:00
5b0244ee8f Tighten error checking in ConcreteModuleType (#37813)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37813

This condition should never fire.

Test Plan: Imported from OSS

Reviewed By: jamesr66a

Differential Revision: D21398021

Pulled By: suo

fbshipit-source-id: 7f2213a020071b8eab80ef40ac6a9de669722548
2020-05-05 11:39:50 -07:00
782b53b654 Specify _th_ ops in CUDAUnaryOps macros so they are easier to find. (#37582)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/37582

Test Plan: Imported from OSS

Differential Revision: D21328055

Pulled By: gchanan

fbshipit-source-id: de0939dfdb97ab4dca777e0784fc6225ac31abdc
2020-05-05 11:38:24 -07:00
9b3911c073 [quant][graphmode][refactor] rename SwapDequant and refactor code handling general ops (#37555)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/37555

Test Plan:
.

Imported from OSS

Differential Revision: D21393514

fbshipit-source-id: 5bc9fa0f0be25f4c35a64acb23513f64ed07e230
2020-05-05 11:20:15 -07:00
7fa968b10d [TensorExpr] Add python bindings for TE fuser. (#37831)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/37831

Test Plan: Imported from OSS

Reviewed By: jackm321

Differential Revision: D21404947

Pulled By: ZolotukhinM

fbshipit-source-id: 8467346d4fd8413985a33832fb3994d3ead746dc
2020-05-05 10:58:30 -07:00
5c628ddbd0 Fix README for installation from source (#37301)
Summary:
I think, it's help faster compile pytorch from source without errors about incompatible compiler(such as: unsupported GNU version! gcc versions later than 8 are not supported!)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37301

Differential Revision: D21396682

Pulled By: ngimel

fbshipit-source-id: 5e21c36ee550424e820f3aa6e6131ca858994ae4
2020-05-05 10:15:21 -07:00
3b97723f08 Let >> and << support half on CUDA (#37670)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/37670

Differential Revision: D21395325

Pulled By: ngimel

fbshipit-source-id: fcb02f3bee488717cdc1ffc05204970b907d3c3f
2020-05-05 10:10:37 -07:00
3673a7245d graph mode: more in-place activation handling (#37771)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37771

Adds in place handling for other activations in graph mode

Test Plan:
```
python test/test_quantization.py TestQuantizeScriptPTSQOps.test_swap_dequantize
```

Imported from OSS

Differential Revision: D21382825

fbshipit-source-id: 6a4e64bae08fcbfb9bdab92aaac43da98207a1c3
2020-05-05 10:07:50 -07:00
b354700e75 graph mode: round out relu support (#37592)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37592

Makes sure that all the standalone relu flavors are tested in
graph mode.

Test Plan:
```
python test/test_quantization.py TestQuantizeScriptPTSQOps.test_swap_dequantize
```

Imported from OSS

Differential Revision: D21366597

fbshipit-source-id: 103848b76a0c65b9adac5bae98b545aa1d30a9e2
2020-05-05 10:06:04 -07:00
0b693e9601 uninitialize output and bag_size in the fast path of EmbeddingBag to save overhead (#36681)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/36681

Test Plan:
Imported from OSS

Unit tests:
python test/run_test.py -i test_nn -- TestNNDeviceTypeCPU.test_EmbeddingBag_per_sample_weights_failures_cpu
python test/run_test.py -i test_nn -- TestNNDeviceTypeCPU.test_EmbeddingBag_per_sample_weights_and_offsets_cpu
python test/run_test.py -i test_nn -- TestNNDeviceTypeCPU.test_EmbeddingBag_per_sample_weights_and_new_offsets_cpu
python test/run_test.py -i test_nn -- TestNNDeviceTypeCPU.test_EmbeddingBag_per_sample_weights_and_no_offsets_cpu
python test/test_nn.py TestNN.test_embeddingbag_from_pretrained
python test/test_nn.py TestNN.test_embeddingbag_from_pretrained_options

Finally run: python test/test_nn.py

Reviewed By: jspark1105

Differential Revision: D21058006

Pulled By: xing-liu

fbshipit-source-id: 65b36a788839e8b722db3e295e58215b5935d6e8
2020-05-05 09:56:52 -07:00
145560f499 Migrate erf and erf_ from the TH to Aten (CUDA) : Closes #24558 (#36724)
Summary:
Closes https://github.com/pytorch/pytorch/issues/24558
Benchmark with same build settings on same system.
gcc : version 7.5.0 (Ubuntu 7.5.0-3ubuntu1~18.04)
CUDA : 10.1
GPU : 1050ti

```python
import timeit

for n, t in [(10_000, 20000),
             (100_000, 20000)]:
    for dtype in ('torch.half', 'torch.float', 'torch.double'):
        print(f'torch.erf(a) a.numel() == {n} for {t} times {dtype}')
        print(timeit.timeit(f'torch.erf(a); torch.cuda.synchronize()',
                              setup=f'import torch; a=torch.arange({n}, dtype={dtype}, device="cuda")',
                              number=t))
```

Before:

```
torch.erf(a) a.numel() == 10000 for 20000 times torch.half
0.29057903600187274
torch.erf(a) a.numel() == 10000 for 20000 times torch.float
0.2836507789979805
torch.erf(a) a.numel() == 10000 for 20000 times torch.double
0.44974555500084534
torch.erf(a) a.numel() == 100000 for 20000 times torch.half
0.31807255600142526
torch.erf(a) a.numel() == 100000 for 20000 times torch.float
0.3216503109979385
torch.erf(a) a.numel() == 100000 for 20000 times torch.double
2.0413486910001666
```

After:

```
torch.erf(a) a.numel() == 10000 for 20000 times torch.half
0.2867302739996376
torch.erf(a) a.numel() == 10000 for 20000 times torch.float
0.28851128199858067
torch.erf(a) a.numel() == 10000 for 20000 times torch.double
0.4592030350013374
torch.erf(a) a.numel() == 100000 for 20000 times torch.half
0.28704102400115517
torch.erf(a) a.numel() == 100000 for 20000 times torch.float
0.29036039400125446
torch.erf(a) a.numel() == 100000 for 20000 times torch.double
2.04035638699861
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36724

Differential Revision: D21164626

Pulled By: VitalyFedyunin

fbshipit-source-id: e6f3390b2bbb6e8d21e18ffe15f5d49a170fae83
2020-05-05 09:22:54 -07:00
23d0441da7 [JIT] Fix GetAttr inconsistency (#37424)
Summary:
We were previously only looking at class attributes, so that didn't include methods etc, and would silently give wrong semantics. This makes hasAttr go through the same resolution as our other attribute lookups.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37424

Differential Revision: D21282633

Pulled By: eellison

fbshipit-source-id: 8e970f365c2740d137a02331739c2ed93747b918
2020-05-05 09:06:51 -07:00
12e64916b3 Migrate clamp from the TH to Aten (CUDA) (#37646)
Summary:
Fixed https://github.com/pytorch/pytorch/issues/24544

Reference https://github.com/pytorch/pytorch/issues/24507
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37646

Differential Revision: D21395824

Pulled By: VitalyFedyunin

fbshipit-source-id: 111889023d60e3361b5a646bcfb6fb7d5ec969d1
2020-05-05 08:59:52 -07:00
468a9d448e [aten] Pass std::function<> to thread_pool by value, instead of const ref. (#37681)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37681

By passing by value, we can std::move, and avoid unnecessarily copying
args that are part of any std::function/lambda state (e.g. in the jit
interpreter, there is a std::vector<> stack passed in the
InterpreterContinuation)

This makes the api also consistent with e.g. folly and best practices.
Added a minor at::launch() benchmark to test/cpp/, the difference is
mostly noticeable when copying the std::function<> internal args is
non-trivial.

Benchmarks pre/post (min over ~5 runs)
NoData: 5.81 us -> 5.63 us (-3.2%)
WithData(0): 6.67 us -> 5.88 us (-11.8%)
WithData(4): 6.98 us -> 6.51 us (-6.7%)
WithData(256): 9.44 us -> 7.89 (-16.5%)

ghstack-source-id: 103322321

Test Plan:
- perf: buck run mode/opt caffe2/test/cpp/api:parallel_benchmark pre/post
  - correctness buck test mode/dev-nosan caffe2/test/...

Reviewed By: dzhulgakov

Differential Revision: D21355148

fbshipit-source-id: 3567e730845106f1991091e4a892d093e00571c3
2020-05-05 08:41:38 -07:00
d7ccb4b392 Migrate CUDA unary complex kernel to c10::complex (#37647)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/37647

Differential Revision: D21351018

Pulled By: anjali411

fbshipit-source-id: 51e4a4a3bdc9b3f8b9f7a5e0d65c06f209c55401
2020-05-05 08:02:00 -07:00
51c9444274 Enable test_distributed test test_backend_full_group (#37794)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/37794

Differential Revision: D21392441

Pulled By: mrshenli

fbshipit-source-id: 5621b9341a676b695244790ba125d08491a3fe6f
2020-05-05 07:56:57 -07:00
7c2944899b Add vec256 for c10::complex (#37690)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/37690

Test Plan: Imported from OSS

Differential Revision: D21394694

Pulled By: anjali411

fbshipit-source-id: 4f0e68280e9c9faf398cfb2d213ecdc4f5cde9fb
2020-05-05 07:18:57 -07:00
6133be31bd Fix for hooks with no name (#37785)
Summary:
Fix https://github.com/pytorch/pytorch/issues/37672

Make sure we only access fields that exist and handle python errors correctly.

Before the fix, the given test would throw:
```
AttributeError: 'MyHookClass' object has no attribute '__name__'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "test_autograd.py", line 432, in test_hook_with_no_name
    x.sum().backward()
  File "/Users/albandes/workspace/pytorch_dev/torch/tensor.py", line 184, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph)
  File "/Users/albandes/workspace/pytorch_dev/torch/autograd/__init__.py", line 115, in backward
    allow_unreachable=True)  # allow_unreachable flag
SystemError: <built-in method run_backward of torch._C._EngineBase object at 0x112fd8100> returned a result with an error set
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37785

Differential Revision: D21387946

Pulled By: albanD

fbshipit-source-id: dcb9afa37b3e10620dc9182d8aa410e7130ffb64
2020-05-05 07:14:35 -07:00
16c7907ad0 Migrate CUDA fill kernel to c10::complex (#37651)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/37651

Test Plan: Imported from OSS

Differential Revision: D21394351

Pulled By: anjali411

fbshipit-source-id: 4fad836700ea25c184dcf4824829f85a0b1e2510
2020-05-05 06:55:30 -07:00
d4edbbd396 Revert D21369541: Make a separate cmake option for caffe2 tests
Test Plan: revert-hammer

Differential Revision:
D21369541

Original commit changeset: 669cff70c5b5

fbshipit-source-id: 500d261eaf3f02bcd698d343480b9e951e2844b9
2020-05-05 06:30:52 -07:00
0549e1f384 [Tensorpipe/RPC] tensorpipe RPC agent (#35483)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35483

Implement the initial version of TensorPipe RPC agent, and register to RPC registry to expose to Python interface. As a starter, it utilizes all available TensorPipe transports (shm, uv) and channels (basic, cma).

Test Plan:
https://our.intern.facebook.com/intern/diffusion/FBS/browse/master/fbcode/experimental/jiayisuse/tensorpipe_rpc
  export MASTER_ADDR=127.0.0.1
  export MASTER_PORT=28500
  buck build mode/dev-nosan mode/no-gpu //experimental/jiayisuse/tensorpipe_rpc:main
  ./buck-out/gen/experimental/jiayisuse/tensorpipe_rpc/main.par
  buck build mode/dev-nosan mode/no-gpu //experimental/jiayisuse/tensorpipe_rpc:benchmark
  ./buck-out/gen/experimental/jiayisuse/tensorpipe_rpc/benchmark.par

Multiple connections with async echo
  ./buck-out/gen/experimental/jiayisuse/tensorpipe_rpc/async_echo.par

Reviewed By: lw

Differential Revision: D20088366

fbshipit-source-id: 980f641af3321ca93583c62753e1c9174b7d4afc
2020-05-05 05:47:43 -07:00
3411ec6e32 [TensorPipe/RPC] Serialize and deserialize message (#36197)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36197

Create APIs to convert between rpc::message and tensorpipe::message
1. tensorpipeSerialize() - converts rpc::message to tensorpipe::message without memory copy (tensors).
2. tensorpipeAllocateMessage - allocates rpc::message based on received tensorpipe descriptor to prepare memory-copy-free receiving.

Test Plan: buck test caffe2/test/cpp/rpc:test_tensorpipe_serialization

Reviewed By: lw

Differential Revision: D20084125

fbshipit-source-id: ffbc310f93443e50261aed752be0fe176610dd2a
2020-05-05 05:45:57 -07:00
7fa897eac0 [caffe2] L2 regularization for (RowWise)SparseAdagrad fusion on GPUs (#37805)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37805

Resolve the unit test failures after  https://github.com/pytorch/pytorch/pull/37653

Test Plan:
```
buck test mode/dev-nosan //caffe2/caffe2/fb/net_transforms/tests:fuse_sparse_ops_test -- 'test_fuse_sparse_adagrad_with_sparse_lengths_sum_gradient \(caffe2\.caffe2\.fb\.net_transforms\.tests\.fuse_sparse_ops_test\.TestFuseSparseOps\)'
```

```
buck test mode/dev-nosan //caffe2/caffe2/fb/net_transforms/tests:fuse_sparse_ops_test -- 'test_fuse_sparse_adagrad_with_sparse_lengths_weighted_sum_gradient \(caffe2\.caffe2\.fb\.net_transforms\.tests\.fuse_sparse_ops_test\.TestFuseSparseOps\)'
```

Reviewed By: jspark1105

Differential Revision: D21395764

fbshipit-source-id: e8224a1ecbff5dce42ab732c0977de352fe98914
2020-05-05 00:05:32 -07:00
429d90f648 [BE] Split pytorch_linux_test into 3 steps (#37808)
Summary:
First one is to download build artifacts
Second is to run tests
Third is to upload test metadata (runs always, even if `Run` step has failed)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37808

Differential Revision: D21398398

Pulled By: malfet

fbshipit-source-id: da23c499a84136e12e88adcc60206ea26bc843c9
2020-05-04 23:48:23 -07:00
458134f021 Add several ops for portal NLU/ASR model (again) (#37801)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37801

D21215050  was reverted. Re do it.

Test Plan: build, CI

Reviewed By: iseeyuan

Differential Revision: D21393474

fbshipit-source-id: 2e86d5d1980a122a847e146dc6357627ec31d80d
2020-05-04 23:38:04 -07:00
aff92ef3d6 Make a separate cmake option for caffe2 tests (#37721)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37721

Even though we disabled caffe2 test configs in Python, the BUILD_TEST
option was still building caffe2 test cpp binaries and various CI
configurations were running them (since they just run every binary in
`torch/test`).

This PR adds a caffe2-specific BUILD_TEST option (BUILD_CAFFE2_TEST),
which defaults to OFF, and gates the compilation of caffe2 test cpp
binaries under it.

Test Plan: Imported from OSS

Differential Revision: D21369541

Pulled By: suo

fbshipit-source-id: 669cff70c5b53f016e8e016bcb3a99bf3617e1f9
2020-05-04 23:26:27 -07:00
faad00a290 add qnnpack path for hardtanh (#35779)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35779

Adds a QNNPack path for the clamp kernel, which is useful for
hardtanh.

Test Plan:
python test/test_quantized.py TestQNNPackOps.test_hardtanh

Imported from OSS

Differential Revision: D20778588

fbshipit-source-id: 537de42e795a9c67924e1acb1d33b370beb9dbf5
2020-05-04 21:58:11 -07:00
f5e6f39e00 Remove std::complex to std::complex casting specialization (#37574)
Summary:
This is no longer needed because cuda copy kernel now uses `c10::complex`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37574

Differential Revision: D21328501

Pulled By: ngimel

fbshipit-source-id: dd5226e8b6c54915fb6ee52240a446f0ca30a800
2020-05-04 21:50:10 -07:00
15df33f797 [Onnxifi] Cache output shape inference result for OnnxifiOp (#37796)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37796

Shape inference is costly. In bad cases, if we have a lot of uneven tails, we are going to do quite amount of shape inference. This diff will enable each Onnxifi operator to cache the shape inference result for given batch size. In the worst case, we will occupy `num_inference_threads * max_batch_size` OutputReshapeInfo objects per model, where `num_inference_threads` and `max_batch_size` are smaller than 64.

Reviewed By: benjibc

Differential Revision: D21389946

fbshipit-source-id: 23473e64c338d64d15c70292cca0056205d980eb
2020-05-04 21:27:28 -07:00
1845545075 Enable HgemmBatched for ROCm (#37483)
Summary:
The purpose of this PR is to enable HgemmBatched for ROCm. Since the inconsistency between CUDA_VERSION and HIP_VERSION, resulting in THCudaBlas_HgemmStridedBatched() not to be called.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37483

Differential Revision: D21395699

Pulled By: ngimel

fbshipit-source-id: c5c22d5f2041d4c9911558b2568fc9ce33ddeb5d
2020-05-04 20:51:27 -07:00
4a2c642e1f fix ROCm bench CI by increasing first iter timeout (#37633)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/37633

Differential Revision: D21395519

Pulled By: ngimel

fbshipit-source-id: 03b31417dde0758db6c189c21b6cb5771c776115
2020-05-04 20:49:32 -07:00
090ea775c9 Math functions of c10::complex should be overloaded as const reference (#37689)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37689

It has to be this way, otherwise, we will not be able to use it in vec256 because the function pointers declared there are using const reference.

Test Plan: Imported from OSS

Differential Revision: D21394603

Pulled By: anjali411

fbshipit-source-id: daa075b86daaa694489c883d79950a41d6e996ba
2020-05-04 19:59:28 -07:00
8e5f162b4c [FakeLowp] Reset workspace in test (#37799)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37799

Failure to do so will results on some workspace contention.

Test Plan: unittest

Reviewed By: amylittleyang

Differential Revision: D21390900

fbshipit-source-id: 9e837f0f7aae32230740604069308f35b73612b9
2020-05-04 19:41:23 -07:00
1d43d7caa2 Use gpu_kernel in Affine Quantizer (#37312)
Summary:
Removes `CUDA_tensor_apply2` from Affine Quantizer.

cc: zasdfgbnm

# Profiling

## This PR

### quint8

```==4458==       Range "quantize_per_tensor, seq = 0"
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
          Range:  100.00%  4.8703ms        20  243.52us  207.60us  312.66us  quantize_per_tensor, seq = 0
 GPU activities:  100.00%  751.95us        10  75.194us  74.372us  79.044us  _ZN2at6native6modern29vectorized_elementwise_kernelILi4EZZZNS0_75_GLOBAL__N__51_tmpxft_0000424b_00000000_6_affine_quantizer_cpp1_ii_92f2f7d738quantize_tensor_per_tensor_affine_cudaENS_6TensorES4_dlENKUlvE_clEvENKUlvE0_clEvEUlfN3c106quint8EE_NS_6detail5ArrayIPcLi3EEEEEviT0_T1_
      API calls:  100.00%  162.48us        10  16.247us  13.383us  35.997us  cudaLaunchKernel
```

### qint8

```==14289==       Range "quantize_per_tensor, seq = 0"
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
          Range:  100.00%  4.8143ms        20  240.71us  155.68us  327.78us  quantize_per_tensor, seq = 0
 GPU activities:  100.00%  748.85us        10  74.884us  73.892us  78.565us  _ZN2at6native6modern29vectorized_elementwise_kernelILi4EZZZNS0_75_GLOBAL__N__51_tmpxft_0000424b_00000000_6_affine_quantizer_cpp1_ii_92f2f7d738quantize_tensor_per_tensor_affine_cudaENS_6TensorES4_dlENKUlvE_clEvENKUlvE_clEvEUlfN3c105qint8EE_NS_6detail5ArrayIPcLi3EEEEEviT0_T1_
      API calls:  100.00%  166.61us        10  16.661us  13.387us  39.237us  cudaLaunchKernel
```

### qint32

```
==17303==       Range "quantize_per_tensor, seq = 0"
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
          Range:  100.00%  19.011ms        20  950.55us  308.07us  1.0331ms  quantize_per_tensor, seq = 0
 GPU activities:  100.00%  1.1440ms        10  114.40us  113.42us  117.74us  _ZN2at6native6modern29vectorized_elementwise_kernelILi4EZZZNS0_75_GLOBAL__N__51_tmpxft_0000424b_00000000_6_affine_quantizer_cpp1_ii_92f2f7d738quantize_tensor_per_tensor_affine_cudaENS_6TensorES4_dlENKUlvE_clEvENKUlvE1_clEvEUlfN3c106qint32EE_NS_6detail5ArrayIPcLi3EEEEEviT0_T1_
      API calls:  100.00%  163.78us        10  16.378us  13.747us  35.668us  cudaLaunchKernel
```

## Original

commit: b428f454e13f6e8055124ea19c32b554017137d0

### quint8

```
==4361==       Range "quantize_per_tensor, seq = 0"
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
          Range:  100.00%  5.6212ms        20  281.06us  230.17us  352.82us  quantize_per_tensor, seq = 0
 GPU activities:  100.00%  780.85us        10  78.084us  77.633us  78.561us  _ZN2at4cuda75_GLOBAL__N__51_tmpxft_00007fda_00000000_6_affine_quantizer_cpp1_ii_13ee0d7721kernelPointwiseApply2IZZZNS_6native75_GLOBAL__N__51_tmpxft_00007fda_00000000_6_affine_quantizer_cpp1_ii_13ee0d7738quantize_tensor_per_tensor_affine_cudaENS_6TensorES5_dlENKUlvE_clEvENKUlvE0_clEvEUlRfRN3c106quint8EE_fSA_jLi1ELi1ELi1EEEvNS0_6detail10TensorInfoIT0_T2_EENSE_IT1_SG_EESG_T_
      API calls:  100.00%  166.07us        10  16.606us  13.535us  36.578us  cudaLaunchKernel
```

### qint8

```
==12583==       Range "quantize_per_tensor, seq = 0"
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
          Range:  100.00%  5.5765ms        20  278.82us  226.51us  351.23us  quantize_per_tensor, seq = 0
 GPU activities:  100.00%  783.28us        10  78.328us  77.826us  80.386us  _ZN2at4cuda75_GLOBAL__N__51_tmpxft_00007fda_00000000_6_affine_quantizer_cpp1_ii_13ee0d7721kernelPointwiseApply2IZZZNS_6native75_GLOBAL__N__51_tmpxft_00007fda_00000000_6_affine_quantizer_cpp1_ii_13ee0d7738quantize_tensor_per_tensor_affine_cudaENS_6TensorES5_dlENKUlvE_clEvENKUlvE_clEvEUlRfRN3c105qint8EE_fSA_jLi1ELi1ELi1EEEvNS0_6detail10TensorInfoIT0_T2_EENSE_IT1_SG_EESG_T_
      API calls:  100.00%  161.05us        10  16.104us  13.363us  34.284us  cudaLaunchKernel
```

### qint32

```
==17267==       Range "quantize_per_tensor, seq = 0"
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
          Range:  100.00%  19.815ms        20  990.77us  381.03us  1.0717ms  quantize_per_tensor, seq = 0
 GPU activities:  100.00%  1.1778ms        10  117.78us  117.51us  118.44us  _ZN2at4cuda75_GLOBAL__N__51_tmpxft_00007fda_00000000_6_affine_quantizer_cpp1_ii_13ee0d7721kernelPointwiseApply2IZZZNS_6native75_GLOBAL__N__51_tmpxft_00007fda_00000000_6_affine_quantizer_cpp1_ii_13ee0d7738quantize_tensor_per_tensor_affine_cudaENS_6TensorES5_dlENKUlvE_clEvENKUlvE1_clEvEUlRfRN3c106qint32EE_fSA_jLi1ELi1ELi1EEEvNS0_6detail10TensorInfoIT0_T2_EENSE_IT1_SG_EESG_T_
      API calls:  100.00%  172.26us        10  17.226us  14.094us  37.952us  cudaLaunchKernel
```

##

# Environment

```shell
Collecting environment information...
PyTorch version: 1.6.0a0+010771e
Is debug build: No
CUDA used to build PyTorch: 10.2

OS: Ubuntu 18.04.3 LTS
GCC version: (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
CMake version: version 3.14.0

Python version: 3.7
Is CUDA available: Yes
CUDA runtime version: 10.2.89
GPU models and configuration: GPU 0: TITAN V
Nvidia driver version: 440.33.01
cuDNN version: /usr/local/cuda-10.2/targets/x86_64-linux/lib/libcudnn.so.7

Versions of relevant libraries:
[pip] numpy==1.18.1
[pip] torch==1.6.0a0+010771e
[conda] blas                      1.0                         mkl
[conda] magma-cuda102             2.5.2                         1    pytorch
[conda] mkl                       2020.0                      166
[conda] mkl-include               2020.0                      166
[conda] mkl-service               2.3.0            py37he904b0f_0
[conda] mkl_fft                   1.0.15           py37ha843d7b_0
[conda] mkl_random                1.1.0            py37hd6b4f25_0
[conda] torch                     1.6.0a0+010771e           dev_0    <develop>
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37312

Differential Revision: D21383938

Pulled By: jerryzh168

fbshipit-source-id: 21539675267c64508a6b9eafcde1a8861d1fb421
2020-05-04 19:31:27 -07:00
847d102e93 docs: Fixed docstring indentation for documentation (#37739)
Summary:
Hello there,

I was going through the default initialization of some layers, and ended up on the `torch.nn.init` documentation. As shown below, there was a slight issue with the docstrings of both `kaiming_normal_` and `kaiming_uniform_` that yielded a wrong list of function parameters:

![doc_issue](https://user-images.githubusercontent.com/26927750/80923512-88e30400-8d84-11ea-8708-36ed3a0f7749.png)

This PR fixes the indentation in the corresponding docstrings.

Any feedback is welcome!
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37739

Differential Revision: D21393728

Pulled By: ngimel

fbshipit-source-id: 64523cb328e72d2e51c2c42b20a4545c1ec5f478
2020-05-04 19:08:55 -07:00
53ca3e5b9c Migrate CUDA cat, scatter, gather, index, index_put to c10::complex (#37650)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/37650

Test Plan: Imported from OSS

Differential Revision: D21394299

Pulled By: anjali411

fbshipit-source-id: dd7666af736e720b54b978cc62570d0e840e092f
2020-05-04 18:58:28 -07:00
209c6f9ab5 Move device type init from BackendSelect to backend kernels (#37402)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37402

Previously, BackendSelect kernels did just-in-time device type
initialization by calling `LegacyTypeDispatch.initForDispatchKey()`
with a computed dispatch key. Here we move the initialization to
the backend kernels themselves, where we can call the device-
specific initializer directly.

Putting this up to run tests on it, but a couple questions remain:
* why were only BackendSelect kernels doing this initialization?
  Not all factory ops appear there, nor are all the ops that do
  appear there factory ops. Currently we generate init code for
  exactly the BackendSelect ops, but the choice should be better
  motivated.
* the previous scheme maps HIP to its own legacy type dispatch
  entry, but the logic assumes it's exclusive with CUDA, and no
  ops appear to mention HIP explicitly, so the new logic doesn't
  expose a static entry point for it. Needs to be verified.

Test Plan: Imported from OSS

Differential Revision: D21282974

Pulled By: bhosmer

fbshipit-source-id: cd46eb788596948e0572a15fac0f8b43feca5d75
2020-05-04 18:44:43 -07:00
0c2a72ec41 Update README to include few (missing?) links (#37714)
Summary:
Update of README
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37714

Differential Revision: D21393786

Pulled By: ngimel

fbshipit-source-id: 8ae12b38989cbfcdd4d69db1c1ab3bbac0e0db61
2020-05-04 18:34:58 -07:00
d16c8238e1 [ONNX] Fix numerical errors in softmax when dim is not last dimension (#37326)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/34585.

This PR improves the workaround for the problem of different semantics between ONNX softmax and Pytorch softmax.

In Pytorch the `dim` parameter specifies over which dimension normalize the values.  ONNX on the other hand always coerces the input into a 2D tensor and the `axis` parameter specifies which dimensions represent rows and columns of the resulting tensor. As a result, only when we are normalizing the last dimension (`dim == ndim - 1`) semantics are the same.

Previously this was handled by recognizing the `dim == ndim - 1` case and using `softmax` for that. All other cases used a fallback path of explicit invocations of exp, reducesum and div operators to compute the result. Unfortunately, this results in numeric errors when input values are large: the result of exp will produce infinity on both numerator and denumerator and the division of that will result in NaN.

This can be improved by transposing the input tensor so that we can reuse ONNX softmax.

Similar approach has been applied to `logsoftmax` function in https://github.com/pytorch/pytorch/issues/30433.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37326

Reviewed By: hl475

Differential Revision: D21389712

Pulled By: houseroad

fbshipit-source-id: 554fd1b98231a28984c30c7e7abd3c0643386ff7
2020-05-04 18:07:38 -07:00
804e32a467 split out docs tests into separate job (#37793)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/37793

Test Plan: Imported from OSS

Differential Revision: D21392798

Pulled By: suo

fbshipit-source-id: 172fb0522d0b168ca19a382e5fb1eb87b6390acc
2020-05-04 17:58:04 -07:00
57dc4cd0f8 [MultiProcessTestCase] Improve the error message when a process terminates (#37627)
Summary:
When a subprocess terminates with an exception in a distributed test, log the process number as well
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37627

Differential Revision: D21366149

Pulled By: rohan-varma

fbshipit-source-id: 132c4b4c1eb336761c2be26d034d8b739ae19691
2020-05-04 17:46:36 -07:00
20e5749129 Migrate CPU casting copy kernel to c10::complex (#37649)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/37649

Test Plan: Imported from OSS

Differential Revision: D21385430

Pulled By: anjali411

fbshipit-source-id: 66106edfc682fc75f293babece1dc4323aa3aeb1
2020-05-04 16:41:27 -07:00
0a24f33dc1 [quant][mobile] Return for conv with empty batch (#37779)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37779

We should just return empty output

Test Plan: Imported from OSS

Differential Revision: D21385789

fbshipit-source-id: 4b42f5aaebabfa3f329ed74356bddb33daad98d5
2020-05-04 15:41:14 -07:00
4fef3763dd Revert "Revert D21337640: [pytorch][PR] Split up documentation into subpages and clean up some warnings" (#37778)
Summary:
Original PR: https://github.com/pytorch/pytorch/pull/37419

cc mattip suo
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37778

Differential Revision: D21385774

Pulled By: ezyang

fbshipit-source-id: 5de532faab8bae132736b6b5189e0ee2ac9935be
2020-05-04 14:32:35 -07:00
4025d87843 Kill the ability to codegen tensor-based broadcasting. (#37547)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37547

This shouldn't be used anymore.

Test Plan: Imported from OSS

Differential Revision: D21315037

Pulled By: gchanan

fbshipit-source-id: 12728f1d0e1856bf3e8fe1bfcf36cddd305a4a76
2020-05-04 13:38:56 -07:00
73aa49d529 Move addr broadcasting from codegen layer to native layer. (#37546)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/37546

Test Plan: Imported from OSS

Differential Revision: D21315040

Pulled By: gchanan

fbshipit-source-id: 1bba97bd889ec286e3e7f1d0f0450871b996c9ae
2020-05-04 13:38:51 -07:00
e38d7591a7 Move broadcasting code for fmod, fmod_ from codegen layer. (#37545)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/37545

Test Plan: Imported from OSS

Differential Revision: D21315036

Pulled By: gchanan

fbshipit-source-id: cbe82205dc71c2a704d717a5f82827fc6ff5106c
2020-05-04 13:37:12 -07:00
4cdaa5956c capitalize fuseTensorExpr (#37780)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/37780

Differential Revision: D21386092

Pulled By: Krovatkin

fbshipit-source-id: c190f891fe25b3cee9a34b5173756c39efd49c66
2020-05-04 12:40:49 -07:00
fe8fdb775f [quant][graph] Fix bug in replicateDequant (#37637)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37637

Insert dequant op at specific offset, rather than for all inputs of user

Test Plan:
python test/test_quantization.py

Imported from OSS

Differential Revision: D21354931

fbshipit-source-id: 79a1dc63b0ed96c3d51d569116ed963106085d3b
2020-05-04 12:37:29 -07:00
a6aa336cc2 [quant][graph] Fix bug in replaceConvolutionWithConv2d (#37635)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37635

replaceConvolutionWithConv2d incorrectly assumes that the size of padding is 2. For Conv1d it is 1, in which case we cannot replace with aten::conv2d

Test Plan: Imported from OSS

Differential Revision: D21354930

fbshipit-source-id: a2dbad856666b4bbb2d9015ade8e1704774f20dd
2020-05-04 12:35:59 -07:00
77dd00c850 Permit registration of multiple triggers, but insert warning (#37772)
Summary:
If linking the same file multiple times, the trigger check becomes severe and crashes execution at startup.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37772

Differential Revision: D21384072

Pulled By: bwasti

fbshipit-source-id: 3396e69cd361f65e50517970d23497804c76023e
2020-05-04 12:18:11 -07:00
a058e938f9 Refactor error msg stack handling, add TORCH_RETHROW (#37101)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37101

Fixes #36954.

The basic concept is to streamline the process of rethrowing
c10::Error with extra error information.  This is in a few
steps:

- I completely remodeled the Error data type and the internal
  invariants.  Instead of manually adding in newlines, the
  message stack formatting process is responsible for inserting
  newlines and spacing as necessary.  Call sites are then
  modified to respect the new API model.
- TORCH_RETHROW macro is added, which adds context to an error
  message and then rethrows it.

New internal assert failure looks like:

```
0 INTERNAL ASSERT FAILED at ../c10/test/util/exception_test.cpp:64, please report a bug to PyTorch.
Exception raised from TestBody at ../c10/test/util/exception_test.cpp:64 (most recent call first):
frame #0: <unknown function> + 0x6aab9 (0x7ff611d3aab9 in /data/users/ezyang/pytorch-tmp/build/lib/libc10.so)
frame #1: ...
```

Error message with context looks like:

```
This is an error
  This is context 1
  This is context 2
```

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Differential Revision: D21202891

Pulled By: ezyang

fbshipit-source-id: 361cadd16bc52e5886dba08e79277771ada76169
2020-05-04 11:56:45 -07:00
efd8f70cac Make msg() and msg_with_backtrace() private (#37094)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37094

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Differential Revision: D21202892

Pulled By: ezyang

fbshipit-source-id: d59e6bffabd90cc734056bdce2cd1fe63262fab8
2020-05-04 11:54:34 -07:00
6dd1beaaa8 To fix caffe2 model with Copy OP cannot export to onnx model (#37144)
Summary:
To fix caffe2 model with Copy OP cannot export to onnx model
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37144

Reviewed By: houseroad

Differential Revision: D21252421

Pulled By: yinghai

fbshipit-source-id: 4f1077188f36b0691d199e418880bbb27f11032d
2020-05-04 11:34:09 -07:00
1bac49f075 Migrate item() to c10::complex (#37648)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/37648

Test Plan: Imported from OSS

Differential Revision: D21382318

Pulled By: anjali411

fbshipit-source-id: c1d3da43f118f18739bb34906f76a5bad097c905
2020-05-04 11:15:51 -07:00
c0ff085775 [PyTorch] Modify data_parallel to work with small tensors (#37704)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37704

If input tensor can not be chunked, run `parallel_apply` on fewer devices
Modfy input tensor dimention in `DataParallelUsesAllAvailableCUDADevices_CUDA` to be chunkable by any number of available CUDA devices

Test Plan: Run `test/cpp/api/parallel` on machine  with 6 GPUs

Differential Revision: D21365416

fbshipit-source-id: 60fdfed4a0e6256b2c966c2ea3e8d0bfb298d9a8
2020-05-04 11:06:42 -07:00
20f7e62b1d Revert D21337640: [pytorch][PR] Split up documentation into subpages and clean up some warnings
Test Plan: revert-hammer

Differential Revision:
D21337640

Original commit changeset: d4ad198780c3

fbshipit-source-id: fa9ba6ac542173a50bdb45bfa12f3fec0ed704fb
2020-05-04 10:57:55 -07:00
fd05debbcd [TS][easy] Typo Fix (#37773)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37773

As Title says
ghstack-source-id: 103385174

Test Plan: CI

Reviewed By: dmudiger

Differential Revision: D21374951

fbshipit-source-id: a2fc48b931f0cecbc8a995bf4b4ace30a8eb0d70
2020-05-04 10:41:07 -07:00
812a3fa03d Show warning if Tensor.random_()'s from and to are not in [-(2^digits), 2^digits] bounds for floating-point types (#37537)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37537

The documentation states that `random_()` samples "from the discrete uniform distribution". Floating-point types can support _discrete_ _uniform_ distribution only within range [-(2^digits), 2^digits], where `digits = std::numeric_limits<fp_type>::digits`, or

- [-(2^53), 2^53] for double
- [-(2^24), 2^24] for double
- [-(2^11), 2^11] for half
- [-(2^8), 2^8] for bfloat16

The worst scenario is when the floating-point type can not represent numbers between `from` and `to`. E.g.
```
torch.empty(10, dtype=torch.float).random_(16777217, 16777218)
tensor([16777216., 16777216., 16777216., 16777216., 16777216., 16777216.,
        16777216., 16777216., 16777216., 16777216.])
```
Because 16777217 can not be represented in float

Test Plan: Imported from OSS

Differential Revision: D21380387

Pulled By: pbelevich

fbshipit-source-id: 80d77a5b592fff9ab35155a63045b71dcc8db2fd
2020-05-04 10:36:04 -07:00
e6221f4ca1 Remove std::complex from TypeMeta (#37632)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/37632

Differential Revision: D21362056

Pulled By: anjali411

fbshipit-source-id: b20506a36594ad8485ba8ef31d2d8a83ff0862f2
2020-05-04 10:31:34 -07:00
dbcfd62a1c Remove unnecessary pickle and unpickle invocation in PyRRef __setstate__/__getstate__ methods (#37638)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/37638

Test Plan: Imported from OSS

Differential Revision: D21343280

Pulled By: mrshenli

fbshipit-source-id: da462fee5815dc74c7f2dc3161699e461bc7d7d3
2020-05-04 10:26:54 -07:00
b7f258bbd3 add fmt to libtorch_python.so (#37560)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/37560

Test Plan: Imported from OSS

Differential Revision: D21320059

Pulled By: suo

fbshipit-source-id: 95cfe7cf26c515fdfcb4621cc58266d838a38a3e
2020-05-04 10:14:37 -07:00
5216917022 [caffe2/dnnlowp] documentation for pack operator arguments (#37719)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37719

As title

Test Plan: Just updating doc

Reviewed By: hyuen

Differential Revision: D21369227

fbshipit-source-id: a45e5d0fa34aea8046eb4bb83e6c4df4d2654252
2020-05-04 09:59:51 -07:00
aa54f58041 LoopOptions::gpu_block_index(): bool -> int (#37578)
Summary:
Small change to allow MSVC build pass.
The error is

```
D:\pytorch-scripts\caffe2_builders\v141\pytorch\torch/csrc/jit/tensorexpr/stmt.h(370): error C4805: '!=': unsafe mix
of type 'bool' and type 'int' in operation (compiling source file D:\pytorch-scripts\caffe2_builders\v141\pytorch\torch
\csrc\jit\passes\tensorexpr_fuser.cpp) [D:\pytorch-scripts\caffe2_builders\v141\pytorch\build\RelWithDebInfo\caffe2\tor
ch_cpu.vcxproj]
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37578

Differential Revision: D21348964

Pulled By: ezyang

fbshipit-source-id: 2c5f995e0adbeb681c18625b59250d7ee3e958ef
2020-05-04 09:43:47 -07:00
f10fbcc820 Split up documentation into subpages and clean up some warnings (#37419)
Summary:
xref gh-32838, gh-34032

This is a major refactor of parts of the documentation to split it up using sphinx's `autosummary` feature which will build out `autofuction` and `autoclass` stub files and link to them. The end result is that the top module pages like torch.nn.rst and torch.rst are now more like table-of-contents to the actual single-class or single-function documentations pages.

Along the way, I modified many of the docstrings to eliminate sphinx warnings when building. I think the only thing I changed from a non-documentation perspective is to add names to `__all__` when adding them to `globals()` in `torch.__init__.py`

I do not know the CI system: are the documentation build artifacts available after the build, so reviewers can preview before merging?
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37419

Differential Revision: D21337640

Pulled By: ezyang

fbshipit-source-id: d4ad198780c3ae7a96a9f22651e00ff2d31a0c0f
2020-05-04 09:39:22 -07:00
b1e4e4d470 Remove zero_dim_dispatch_when_scalar (#37580)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/33094
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37580

Differential Revision: D21380154

Pulled By: ezyang

fbshipit-source-id: 4556c7ca6126a7d382f6343aee14a7e46c498ac3
2020-05-04 09:26:45 -07:00
5ec87a3c1a Move baddbmm broadcasting from codegen layer to native layer. (#37544)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/37544

Test Plan: Imported from OSS

Differential Revision: D21315039

Pulled By: gchanan

fbshipit-source-id: aa564d06d415ad2468b4898f84edbb03a3ee698f
2020-05-04 08:42:22 -07:00
66a20c259b [CircleCI] Store build artifacts for python docs (#37658)
Summary:
This PR allows the build artifacts for python docs to be stored on CircieCI, which helps the reviewer to preview doc changes before merging.

The artifacts can be found in the [`ARTIFACTS` tab](  https://app.circleci.com/pipelines/github/pytorch/pytorch/162986/workflows/a969f256-3243-414f-8a02-1234b9dac149/jobs/5320907/artifacts) of the test **pytorch_cpp_doc_push**, and the website is served at https://5320907-65600975-gh.circle-artifacts.com/0/docs/index.html

This PR is inspired by rgommers's comment under https://github.com/pytorch/pytorch/pull/37419#issuecomment-621420500

> There's a CircleCI job pytorch_python_doc_push that builds the docs, however it doesn't store any artifacts for PRs. Controlled by .circleci/scripts/python_doc_push_script.sh. I think that's the only doc build (?). Not sure why it doesn't store the artifacts, that would be useful.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37658

Differential Revision: D21380094

Pulled By: ezyang

fbshipit-source-id: 1dd44bf836ebc74454f4444ae9321807dccdb313
2020-05-04 07:29:58 -07:00
bcdff7eb67 Fix for tests on ROCm (#37616)
Summary:
This pull request fixes and re-enables two of the tests disabled in https://github.com/pytorch/pytorch/issues/37427
1. `test_sparse_add_out_bfloat16` in test_sparse.py fixed to use updated `atol` argument instead of `prec` for `assertEqual`
2. The conversion of `flt_min` to `int64` is divergent on HIP compared to numpy. The change removes that conversion from the `test_float_to_int_conversion_finite` test case in test_torch.py

cc: ezyang jeffdaily
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37616

Differential Revision: D21379876

Pulled By: ezyang

fbshipit-source-id: 2bfb41d67874383a01330c5d540ee516b3b07dcc
2020-05-04 07:16:54 -07:00
6c37ad2674 typo in MultiheadAttention documentation (#37496)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/37496

Differential Revision: D21356999

Pulled By: zou3519

fbshipit-source-id: ba6c7b19053c97e5ab77eb0c44e97d26b04d85da
2020-05-04 07:05:33 -07:00
136bc5a482 Revert D21215050: Add ops for portal NLU model
Test Plan: revert-hammer

Differential Revision:
D21215050

Original commit changeset: 874023c449e4

fbshipit-source-id: 695494d9607bc8823c494fa06830370adccbf935
2020-05-04 06:46:17 -07:00
6a6c29c1c9 Update TensorPipe submodule (#37729)
Summary:
In order to include these fixes that were blocking https://github.com/pytorch/pytorch/pull/35483:
- 673eda9efc
- ff8d1733ad
- c73367836f

Pull Request resolved: https://github.com/pytorch/pytorch/pull/37729

Reviewed By: beauby

Differential Revision: D21378972

Pulled By: lw

fbshipit-source-id: 3375fe1fa6e79817da3bb033127c3c8f31c3ffc3
2020-05-04 04:44:57 -07:00
e26631b333 [caffe2] Shape inference for UnPackRecords
Summary:
Since UnPackRecords is part of the graph, we need to add shape inference for it to make it work e2e with tvm_jit_op. Because the input is packed, shape inference is impossible without shape info of the packed tensors. Some context, the shape of the packed tensor is 1 X num_embeddings X embedding_size, with 1 being the batch_size. The shape of the corresponding output tensor is thus batch_size X num_embeddings X embedding_size after concatenating the packed tensors on the batch axis. Therefore two more gflags need to be added

- caffe2_predictor_num_embeddings
- caffe2_predictor_embedding_size

These gflags are then added to the UnPackRecordsOp in the predict_net as args to pass the info to c2_frontend so TVM can do its own shape inference.

Reviewed By: yinghai

Differential Revision: D21286983

fbshipit-source-id: e9a19cb6b564905282a771df2b9d211d5d37dd71
2020-05-04 01:31:47 -07:00
bd9617d5af [TVM] Implement UnPackRecordsOp (#37489)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/37489

Reviewed By: yinghai

Differential Revision: D21196596

fbshipit-source-id: 58b8bc3afc472e02ef7e3d31151fe8ace2be2a73
2020-05-04 01:30:06 -07:00
843c0230f2 Add ops for portal NLU model (#37192)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37192

Add ops used by portal NLU model to lite interpreter

Test Plan: local test

Reviewed By: iseeyuan

Differential Revision: D21215050

fbshipit-source-id: 874023c449e4c04b9f3f871450a7cf02e8f5f5c4
2020-05-03 20:08:28 -07:00
ffed77d0c8 Updating submodules
Summary:
GitHub commits:

fb6b95d8c7

Test Plan: n/a

Reviewed By: yns88

fbshipit-source-id: 9ff6b68322b92710c7356030bc9e4c364a937ea3
2020-05-03 17:58:44 -07:00
506ae60547 [caffe2] L2 regularization for rowwise fused sparse adagrad (#37653)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37653

Following up D21320243 adding weight_decay to rowwise fused sparse adagrad. This is more involved because we can't reuse g_sq_avg multiple times.

Test Plan: CI

Reviewed By: jspark1105

Differential Revision: D21335643

fbshipit-source-id: 491b385c5eb9c0d1e3d31a1cf50d7eb450c2d39d
2020-05-03 10:44:25 -07:00
3403d27def [caffe2] L2 regularization for fused sparse Adagrad (#37652)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37652

Add weight_decay to fused adagrad operators. This should be landed with the next diff together. Just separating out to make review easier.

Test Plan: CI

Reviewed By: jspark1105

Differential Revision: D21320243

fbshipit-source-id: 1157471988dedd60ba9b62949055f651b1fa028f
2020-05-03 10:44:19 -07:00
8cb1f2f9dc implement L2 regularization for Adagrad in caffe2 and dper (#37705)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37705

Pull Request resolved: https://github.com/pytorch/pytorch/pull/37372

Posted note: [Regularizing SparseNN Against Over-fitting](https://fb.workplace.com/notes/taiqing-wang/regularizing-sparsenn-against-over-fitting/220306075902708/)

**Problem formulation**

L(w) = J(w) + lambda/2 * ||w||^2
J(w) is the empirical loss, and ||w||^2 is the squared L2 norm of the parameters, a.k.a. L2 regularizer.

dL(w)/ dw_i = dJ(w)/dw_i + lambda w_i
dL(w)/ dw_i is the gradient of L(w) w.r.t. w_i.

To implement the L2 regularizer, the gradient of J(w) w.r.t. w_i is added with w_i. lambda is called as weight decay in this implementation.

**Code changes**
* In the initialization method of AdagradOptimizer, a new input argument, weight_decay, is added.
* In the _run function of AdagradOptimizer, the weight decay will be skipped for 1d bias vectors.
* In the parameter update functions of Adagrad, the gradient is updated by weight_decay * w_i. The default value for weight_decay is zero.

Test Plan:
`
buck build caffe2/caffe2/fb/dper/layer_models/tests/split_1:sparse_nn_test_weight_decay
`

`
./buck-out/gen/caffe2/caffe2/fb/dper/layer_models/tests/split_1/sparse_nn_test_weight_decay#binary.par
`

Reviewed By: jspark1105

Differential Revision: D21258652

fbshipit-source-id: d2366ddcd736a03205a2d16f914703b16d9fce8f
2020-05-03 10:42:49 -07:00
cc0f1b22a2 [PyTorch Numeric Suite] Add module output comparison (#36701)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36701

Add module output comparison API.
ghstack-source-id: 103368194

Test Plan: buck test mode/dev caffe2/test:quantization -- 'test_compare_model_outputs'

Differential Revision: D21053197

fbshipit-source-id: cabcafbeeac1b604db069833a0f17ebce506ba65
2020-05-03 00:04:35 -07:00
5baa6b6c34 Add a Bazel build config for TensorPipe (#37691)
Summary:
See https://github.com/pytorch/pytorch/pull/37671 for the CI signals once the TensorPipe RPC agent is added.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37691

Reviewed By: malfet

Differential Revision: D21359470

Pulled By: lw

fbshipit-source-id: 577dd6d73a4a11d67b50d8686628dc6d8b24201d
2020-05-02 01:25:06 -07:00
d639418307 Add timeout injection to faulty agent for testing (#37485)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37485

Adds arbitrary timeout injection to faulty RPC agent. This is to better test scenarios that need information about how long-running RPCs, such as properly testing RPC timeouts and the profiler in all scenarios.

This is done by overriding ProcessGroupAgent's `enqueueSend()` function to inject the timeout. Determining which messages to timeout is done similar to the existing `faulty_messages` by having the user specify a mapping of message to timeout.

Added unit tests that verify RPC timeouts work with builtin + TorchScript functions, which was not tested before.
ghstack-source-id: 103341662

Test Plan: Added unit tests in `FaultyRpcAgentTest`.

Differential Revision: D21296537

fbshipit-source-id: 1dbc21aee14e49780272634e9cbb2b5a448f2896
2020-05-01 23:48:28 -07:00
707e0e86c0 [WIP] retry apt at individual package level and at command level (#37696)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/37696

Differential Revision: D21367632

Pulled By: kostmo

fbshipit-source-id: 2f568ba2b404f2394875e0012fce5b930a16a9db
2020-05-01 22:08:10 -07:00
5e0a24f1f9 [quant][graphmode] Move numerics changing passes before finalize (#37514)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37514

This is to constrain all numerics changing operations in insert quant dequant pass

Test Plan:
python test/test_quantization.py TestQuantizeScriptJitPasses

Imported from OSS

Differential Revision: D21364008

fbshipit-source-id: eb8774e9e4b1db8bf09560e7e4d69d28f9d954a5
2020-05-01 18:40:59 -07:00
b4d486abbc Enable test_DistributedDataParallel_SyncBatchNorm_2D_Input unit test (#33573)
Summary:
Test needs ability to toggle cuDNN/MIOpen at runtime (enabled in PR https://github.com/pytorch/pytorch/issues/33118)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33573

Differential Revision: D21360260

Pulled By: mrshenli

fbshipit-source-id: 6e26edc0932efb5d278c2ffc919979b8eb089216
2020-05-01 18:22:25 -07:00
ae755a73d3 SyncBatchNorm size check update (#37133)
Summary:
Update the requirements on input dimensions for torch.nn.SyncBatchNorm:
1. Checks the aggregated batch size `count_all` instead of batch size in every DDP process https://github.com/pytorch/pytorch/issues/36865
2. Added test function for SyncBatchNorm where every process only has 1 input
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37133

Differential Revision: D21331120

Pulled By: zhaojuanmao

fbshipit-source-id: ef3d1937990006609cfe4a68a64d90276c5085f2
2020-05-01 18:01:30 -07:00
564de515f5 Add an iterator to Block. (#37542)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/37542

Differential Revision: D21314421

Pulled By: resistor

fbshipit-source-id: e54d7a8a5c9c1186be59f69b5b8af030fc054b32
2020-05-01 15:12:49 -07:00
fbf110293d jit/OVERVIEW.md: screen * in 'Node*' for proper rendering. (#37686)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/37686

Test Plan: Imported from OSS

Reviewed By: jamesr66a

Differential Revision: D21358819

Pulled By: ZolotukhinM

fbshipit-source-id: 6425786f3b19d6b3d51c8d5386c3ab31d4344959
2020-05-01 14:44:37 -07:00
099a84ef9b Add overload name for aten::tensor and aten::as_tensor (#37655)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37655

Add override name for aten::tensor and aten::as_tensor.
These two ops are used in NLU model, and they will included them in lite interpreter

Test Plan: verified model can be loaded correctly

Reviewed By: iseeyuan

Differential Revision: D21346142

fbshipit-source-id: 05ff4d9e0bcf7f4f9a30d95ca81aef9c3f6b0990
2020-05-01 14:31:04 -07:00
fa8ab4b80c [pt][quant] Unify numerics between fakequant and quant/dequant (#37188)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37188

Add zero point after rounding in both fakequant and quant.
ghstack-source-id: 103231624

Test Plan:
buck test //caffe2/test:quantization -- --print-passing-details
```
Finished test run: https://our.intern.facebook.com/intern/testinfra/testrun/4222124675094587
Summary (total time 186.50s):
  PASS: 191
  FAIL: 0
  SKIP: 20
    caffe2/test:quantization - test_numerical_consistency_per_tensor (quantization.test_fake_quant.TestFakeQuantizePerTensor)
    caffe2/test:quantization - test_numerical_consistency_per_channel (quantization.test_fake_quant.TestFakeQuantizePerChannel)
    caffe2/test:quantization - test_backward_per_tensor (quantization.test_fake_quant.TestFakeQuantizePerTensor)
    caffe2/test:quantization - test_qadd_scalar_relu (quantization.test_quantized.TestQuantizedOps)
    caffe2/test:quantization - test_mean (quantization.test_quantized.TestQNNPackOps)
    caffe2/test:quantization - test_qnnpack_maxpool2d (quantization.test_quantized.TestQNNPackOps)
    caffe2/test:quantization - test_qhardsigmoid (quantization.test_quantized.TestQNNPackOps)
    caffe2/test:quantization - test_batch_norm3d (quantization.test_quantized.TestQuantizedOps)
    caffe2/test:quantization - test_hardswish (quantization.test_quantized.TestQNNPackOps)
    caffe2/test:quantization - test_qnnpack_sigmoid_sweep (quantization.test_quantized.TestQNNPackOps)
    ...and 10 more not shown...
  FATAL: 0
  TIMEOUT: 0
  OMIT: 0
```

Reviewed By: jspark1105

Differential Revision: D21193552

fbshipit-source-id: f63c072d772f459ca6f0f2132aa836b2714fced1
2020-05-01 14:01:04 -07:00
95465dcbaf autograd: move scalar input to a different device when needed (#35286)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/33870
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35286

Differential Revision: D21229721

Pulled By: albanD

fbshipit-source-id: 6f6a6d44b675457c9580ec2d91da52d12d44f096
2020-05-01 13:56:29 -07:00
cyy
2658bae570 use std::move (#34365)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/34365

Differential Revision: D21349942

Pulled By: mrshenli

fbshipit-source-id: 4deb51cbb557501b43990ec7080c71a839cb5db9
2020-05-01 13:42:23 -07:00
b1790794f6 Enforce Tensor.random_ check that from and to are in tensor dtype bounds (#37507)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37507

Replace `TORCH_WARN` with `TORCH_CHECK` if `Tensor.random_()`'s `from` or `to-1` is out of bounds for tensor's dtype. Previously warning said "This warning will become an error in version 1.6 release, please fix the code in advance", so the time has come.

Related to #33106

Test Plan: Imported from OSS

Differential Revision: D21349413

Pulled By: pbelevich

fbshipit-source-id: ac7c196a48fc58634611e427e65429a948119e40
2020-05-01 12:58:45 -07:00
831c8f362f fix the incorrect merge of profiling information of two tensor types for the same value (#36806)
Summary:
as a part of moving to the dynamic shapes we are now passing `frame_id` to each profiling callback. The implementation of that requires copying profiling callbacks into Interpreter, so `first`s are actually different for every run. The dynamic shapes merging algorithm won't be using `first`, but in the meantime, while we get there, this should be a good enough fix.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36806

Differential Revision: D21307173

Pulled By: Krovatkin

fbshipit-source-id: 7dade56ebcc72ebd40bb7f3d636c7b83c99b628f
2020-05-01 12:53:25 -07:00
b410d03e6e Back out "[c2][opt] nomnigraph transform for ClipRangesGatherSigridHash fusion" (#37675)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37675

Original commit changeset: 2c2481e3d497

(Note: this ignores all push blocking failures!)

Test Plan: Back out D21262085 due to ASAN crash P130123493

Differential Revision: D21353550

fbshipit-source-id: c43c8764322f7e58aca0c1360b1d03966b1d9798
2020-05-01 12:49:17 -07:00
ba7461c135 Add pointer to RPC parameter server tutorial (#37667)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/37667

Test Plan: Imported from OSS

Differential Revision: D21351052

Pulled By: mrshenli

fbshipit-source-id: 8c3f78215f40b5641983f1aea4ac92152a9c136a
2020-05-01 12:18:45 -07:00
49c8a37a0d Fix doc-gen warnings in RPC (#37666)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37666

Add `:orphan:` to avoid "WARNING: document isn't included in any toctree".

Test Plan: Imported from OSS

Differential Revision: D21351053

Pulled By: mrshenli

fbshipit-source-id: 6ff67c418fc1de410c7dc39ad9a0be5c30d07122
2020-05-01 12:17:15 -07:00
ba5137ea9d [pyper] Use Caffe2 ops
Summary: Replace inefficient python code w/ calls to Caffe2 operators

Test Plan: existing unit tests for modified operators

Reviewed By: alyssawangqq

Differential Revision: D21270962

fbshipit-source-id: cb11133be4eff80a24d1358fd7bb7d354075dd8b
2020-05-01 12:06:52 -07:00
675b3fc834 Prevent unbounded growth of sparse tensor in add operation (#36030)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/34964

Sparse cuda add was implemented by just concatenating the indices and values for the tensor. If called repeatedly in a tight loop this will let `nnz` grow unbounded. In the worst case of  `x.add_(x)` it grows exponentially.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36030

Differential Revision: D20873504

Pulled By: zou3519

fbshipit-source-id: d90ed8dda0c89571fb89e358757b5dde299513df
2020-05-01 12:05:15 -07:00
c0a985fcd6 Allow customizing retryable message types in Faulty agent tests (#37450)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37450

It doesn't seem like we could customize the retryable message types by
passing faulty_messages into dist_utils, as the `FaultyRpcAgentTestFixture`
overrode the `rpc_backend_options` function and provided the default list of
retryable message types. Needed to fix this as part of adding timeout injection
support as mentioned in https://github.com/pytorch/pytorch/issues/36272
ghstack-source-id: 103287164

Test Plan: `buck test mode/dev-nosan //caffe2/test/distributed/rpc/faulty_agent:rpc_spawn_faulty -- --print-passing-details`

Differential Revision: D21270127

fbshipit-source-id: e5dd847dcf92f14b490f84e9ee79291698b85ffa
2020-05-01 12:00:36 -07:00
1f09f7ea44 Python API for Complex Storage and storage copy logic (#35771)
Summary:
Following up on this: https://github.com/pytorch/pytorch/pull/35851 cross dtype storage copy is not being used internally, so I have not included cross dtype copy for complex.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35771

Differential Revision: D21319650

Pulled By: anjali411

fbshipit-source-id: 07c72996ee598eba0cf401ad61534494d6f5b5b3
2020-05-01 11:47:22 -07:00
deb4100928 [DistributedSampler] Only create torch.generator and seed when shuffling (#37604)
Summary:
We don't need to create `torch.Generator()` and seed it if we are not shuffling.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37604

Differential Revision: D21346167

Pulled By: rohan-varma

fbshipit-source-id: 6ed560d236bc5c026a7d321755ddc02a29db1604
2020-05-01 10:56:40 -07:00
6ecb5bb1f0 match old fuser rem to eager (#37196)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/37196

Reviewed By: zdevito

Differential Revision: D21223172

Pulled By: Krovatkin

fbshipit-source-id: 4d4ff1127d5dc69ab73f07ca79c1f5b0b4dd9732
2020-05-01 10:55:06 -07:00
ecf1ea75a7 Make c10::ComplexHalf a template specialization of c10::complex (#37426)
Summary:
This PR basically makes `c10::ComplexHalf` a template specialization of `c10::complex`. Since `c10::ComplexHalf` is not used much, this does not include much change.

Due to the fact that `c10::Half` does not have much `constexpr` methods, it is impossible to keep the same API. Currently, we are just completely reusing the old implementation. It is just the name getting changed from `c10::ComplexHalf` to `c10::complex<c10::Half>`. We can always change the implementation in the future when needed. But for now, I think this is OK.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37426

Differential Revision: D21300754

Pulled By: anjali411

fbshipit-source-id: fc0f65adccf97025a727735096780ce8078675a1
2020-05-01 10:49:24 -07:00
22708be5af Migrate tan from TH to ATen (CUDA) (#36906)
Summary:
Closes https://github.com/pytorch/pytorch/issues/24641

Benchmark with same build settings on same system.
gcc : version 7.5.0 (Ubuntu 7.5.0-3ubuntu1~18.04)
CUDA : 10.1
GPU : 1050ti

```python
import timeit

for n, t in [(10_000, 20000),
             (100_000, 20000)]:
    for dtype in ('torch.half', 'torch.float', 'torch.double'):
        print(f'torch.tan(a) a.numel() == {n} for {t} times {dtype}')
        print(timeit.timeit(f'torch.tan(a); torch.cuda.synchronize()',
                              setup=f'import torch; a=torch.arange({n}, dtype={dtype}, device="cuda")',
                              number=t))
```

Before:

```
torch.tan(a) a.numel() == 10000 for 20000 times torch.half
0.28325206200003095
torch.tan(a) a.numel() == 10000 for 20000 times torch.float
0.28363607099998944
torch.tan(a) a.numel() == 10000 for 20000 times torch.double
0.43924326799998425
torch.tan(a) a.numel() == 100000 for 20000 times torch.half
0.3754699589999859
torch.tan(a) a.numel() == 100000 for 20000 times torch.float
0.38143782899999223
torch.tan(a) a.numel() == 100000 for 20000 times torch.double
1.7672172019999834
```

After:

```
torch.tan(a) a.numel() == 10000 for 20000 times torch.half
0.28982524599996395
torch.tan(a) a.numel() == 10000 for 20000 times torch.float
0.29121579000002384
torch.tan(a) a.numel() == 10000 for 20000 times torch.double
0.4599610559998837
torch.tan(a) a.numel() == 100000 for 20000 times torch.half
0.3557764019997194
torch.tan(a) a.numel() == 100000 for 20000 times torch.float
0.34793807599999127
torch.tan(a) a.numel() == 100000 for 20000 times torch.double
1.7564662459999454
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36906

Differential Revision: D21335320

Pulled By: VitalyFedyunin

fbshipit-source-id: efab9c175c60fb09223105380d48b93a81994fb0
2020-05-01 10:17:19 -07:00
df31ddbd98 Add channel shuffle op fp32 + quantized. (#36815)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36815

Pytorch does not have native channel shuffle op.
This diff adds that for both fp and quantized tensors.
For FP implementation is inefficient one. For quantized there is a native
QNNPACK op for this.
ghstack-source-id: 103267234

Test Plan:
buck run caffe2/test:quantization --
quantization.test_quantized.TestQuantizedOps.test_channel_shuffle
X86 implementation for QNNPACK is sse2 so this may not be the most efficient
for x86.

Reviewed By: dreiss

Differential Revision: D21093841

fbshipit-source-id: 5282945f352df43fdffaa8544fe34dba99a5b97e
2020-05-01 10:07:15 -07:00
1510bdd42d Replace empty_affine_quantizer with new_qtensor_cpu. (#36814)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36814

ghstack-source-id: 103218412

From the flamegraph it seems 40% the time we are spending going through the dispatch stack. I think in quantized model where compute can take less time, such overheads become noticeable

{F234432545}

Test Plan: Quantized op tests.

Reviewed By: jerryzh168

Differential Revision: D21093840

fbshipit-source-id: 1b98b57eae403353596fc31171069d2f43b13385
2020-05-01 10:07:10 -07:00
6de949afaf Add quantized adaptive avgpool. (#36813)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36813

- Changes to q_avgpool to map special cases of adaptive avgpool to avgpool.
- Map special cases of adaptive avg pool to avgpool.
ghstack-source-id: 103218410

Test Plan: QuantizedOps.test_adaptive_avgpool2d

Reviewed By: z-a-f

Differential Revision: D21093837

fbshipit-source-id: c45a03b597eaa59e1057561ee4e8e116ac138f8f
2020-05-01 10:07:05 -07:00
f6c82e04a0 Move to using MemoryFormat::ChannelsLast for avgpool2d. (#36812)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36812

ghstack-source-id: 103218413

Test Plan: Quantized op tests.

Reviewed By: z-a-f

Differential Revision: D21093839

fbshipit-source-id: 9b68916e56684efb80dd131eece655a7f3779362
2020-05-01 10:04:49 -07:00
e852b45d9f Overload c10::complex operators inside c10 namespace (#37605)
Summary:
See:
https://github.com/pytorch/pytorch/issues/37563#issuecomment-622062118
http://isocpp.github.io/CppCoreGuidelines/CppCoreGuidelines#Ro-namespace

Tested by eq and ne operator not failing
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37605

Differential Revision: D21349289

Pulled By: anjali411

fbshipit-source-id: d12c89c207e36ebd4c88aa4d06425dd98d58883b
2020-05-01 08:42:35 -07:00
4ed790d742 Adding symbolic sizes, contiguity, stride indices (#36101)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/36101

Reviewed By: jamesr66a

Differential Revision: D20908711

Pulled By: Krovatkin

fbshipit-source-id: f90ce74acffeb645d7d906d07e293164d65ed7e6
2020-05-01 02:01:25 -07:00
9e32a1f5cd [wip] update graph fuser aliasdb in-place (#37106)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37106

Recomputing the aliasdb on every fusion iteration + in every subblock
is hugely expensive. Instead, update it in-place when doing fusion.

The graph fuser pass operates by pushing nodes into a fusion group. So
we start with
```
x, y = f(a, b, c)
```

and end with:
```
x_out, y_out = prim::fusionGroup(a, b, c)
   x_in, y_in = f(a_in, b_in, c_in)
   -> x_in, y_in
```

We destroy the `x` and `y` `Value*`s in the process. This operation is
easy to express as an update to the aliasDb--`x_out` just takes on all
the aliasing information `x` used to have. In particular, since we know
`f` and `prim::fusionGroup` are purely functional, we don't have to mess
with any write information.

This PR is the bare minimum to get this working, in the interest of
unscrewing the compilation times ASAP.

Followups I want to do:
- We don't have a way of expressing deletion of values in AliasDb. In
`graph_fuser.cpp` we sometimes construct nodes that we end up throwing
away, and we are littering `MemoryDAG` with references to dangling
pointers. Because of the way the pass works, it's fine, but this is
fragile so I want to fix it.
- We should decouple alias analysis from write tracking, to simplify the
job of keeping the write caches consistent as we mutate the aliasing
information.
- the tensorexpr fuser doesn't do this and thus is incorrect today, we
need to update it to work.

Test Plan: Imported from OSS

Differential Revision: D21219179

Pulled By: suo

fbshipit-source-id: 8ae5397b3a0ad90edec2fbc555647091f1ad5284
2020-04-30 22:21:35 -07:00
0692804747 add slope == 0 case into standard leaky relu nn test (#37559)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/37559

Test Plan: Imported from OSS

Differential Revision: D21319922

Pulled By: glaringlee

fbshipit-source-id: 212ef8e9d0f0d55a312d282693cd5990e0376c6a
2020-04-30 20:56:11 -07:00
91e74fd843 [JIT] Adds a code_with_constants method to module printing (#37586)
Summary:
Closes https://github.com/pytorch/pytorch/issues/36625
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37586

Differential Revision: D21331385

Pulled By: suo

fbshipit-source-id: 752e63eac8bdd06c6719efb972cdc832ad7c1535
2020-04-30 20:44:01 -07:00
7c4bda7e6f Eliminate warnings for cpp extensions on Windows (#37400)
Summary:
Improve the readability of the logs.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37400

Differential Revision: D21302597

Pulled By: ezyang

fbshipit-source-id: b8cbd33f95b6839ad4c6930bed8750c9b5a2ef7a
2020-04-30 20:28:03 -07:00
5ab36ec98b Move cauchy_() to DistributionTemplates (#37602)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37602

Fixes #37371

Test Plan: Imported from OSS

Differential Revision: D21334739

Pulled By: pbelevich

fbshipit-source-id: b8443c14760ec825da3f7d300ad496578170671f
2020-04-30 20:03:02 -07:00
bedc50ed07 Ensure we are diffing against the right thing in clang-format (#37589)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37589

Apply the fix from https://github.com/pytorch/pytorch/commit/9f890a92 to
the clang-format job as well.

Test Plan: Imported from OSS

Differential Revision: D21330250

Pulled By: suo

fbshipit-source-id: d92b9666ba8b92d049393cbe7f2ce45daa563910
2020-04-30 19:04:33 -07:00
a09cb5f2f5 [quant] quantized reflection_pad1d (#37452)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/37452

Test Plan: Imported from OSS

Differential Revision: D21286659

Pulled By: z-a-f

fbshipit-source-id: f9f4de497a790b296149313562d09f8ead5facee
2020-04-30 18:45:38 -07:00
20f5d4436e Updating submodules
Summary:
GitHub commits:

a253b9719d
6504ae0c4e
e6a4c5a552

Test Plan: n/a

Reviewed By: yns88

fbshipit-source-id: 267d7c60012547549ce2750c4f6c5bd16024fdfa
2020-04-30 18:40:52 -07:00
e841bea465 [quant] QNNPACK Add deconvolution parameters (#36716)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/36716

Test Plan: Imported from OSS

Differential Revision: D21110112

Pulled By: z-a-f

fbshipit-source-id: 4b62e1bb3c3b6a3276bc5f8ee5ead0f513ec0137
2020-04-30 18:32:47 -07:00
5efd10518f [jit] speed up alias analysis (#36345)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36345

During compilation, we spend a huge amount of time in alias analyis.
This PR does a few things to speed it up.

1. Separate the analysis into two phases: one where we build up the
necessary data structures, and the other where we service aliasing
queries. This allows us to defer building indices/maintaining index
consistency until after the "buildup" phase is done.

2. Properly memoize/dynamic program the memory locations lookups.

3. Done naively, setting wildcards invalidates the above memoization,
trigger costly recomputation. So I added a cache-aware `setWildcards`.
Sadly that means you need alias analysis to reach into the guts of
memorydag, but the speedup is worth it.

Sadly, these changes are kind of coupled for correctness reasons, so
they're all here at once.

I used this model (thanks IlyaOvodov) as a provisional benchmark. You
can get it here:
https://www.dropbox.com/s/jlyygn6yygj1jkx/yolov3.zip. Unzip at run
`python test_timing.py`.

Baseline: (752.076s) right before 6bc8ffe82462c77ac4f9b27452046cb1f8f07d92
After optimizing before inlining: (699.593s)
After deferring cache construction: (426.180s)
After cache-aware `setWildcards`: (193.678s)

So a nice 75% speedup to overall compilation. There's a lot more to do
in other places of the compilation pipeline though.

Followup to this PR specifically:  Everything that fans out from the
`analyze` call is the "buildup" phase of AliasDB construction. This
should be factored into a separate analysis pass to statically
distinguish the two phases (right now we just null out stuff to
accomplish the same thing dynamically).

Test Plan: Imported from OSS

Differential Revision: D20952727

Pulled By: suo

fbshipit-source-id: 099f797222d7e71e5c04991584adc2c7eab5a70f
2020-04-30 18:27:41 -07:00
e98ad6c05b [RELAND] Remove patches that circumvent MAGMA bug (#35973)
Summary:
Changelog:
- The magma implementation of small singular square batch matrices had a bug that resulted in nan values in the LU factorization result. This has been fixed in MAGMA 2.5.2. This PR removes the existing patch that was a temporary workaround for this bug.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35973

Test Plan:
- Existing tests for det and lu should pass

This is a re-submit of https://github.com/pytorch/pytorch/issues/34357

Differential Revision: D21336552

Pulled By: seemethere

fbshipit-source-id: 9c3b350966913147f1d5811927f3cae10fe620f1
2020-04-30 16:28:36 -07:00
cd4c3b48a6 Add LN after specialzied output embeddings and flexible LCE (#35178)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35178

* add layer norm (LN) after specialized output embeddings
* add flexible lce inside specialized module

Test Plan:
* unit-tests
  * buck test caffe2/caffe2/fb/dper/layer_models/tests/split_1:sparse_nn_test_4 --
  * buck test caffe2/caffe2/fb/dper/layer_models/tests/split_1:sparse_nn_test_6 --

* workflows
  * flexible lce: f177025325

{F232112501}

  *  LN: f177025301

{F232112982}

Differential Revision: D20586281

fbshipit-source-id: 664e77cb4cb5bec6646cafd2e4afb88aff27df03
2020-04-30 15:32:09 -07:00
6f8838cd2f Revert D21326386: [pytorch][PR] [Reland] Implement cusparse Descriptor class and clean up cusparse code
Test Plan: revert-hammer

Differential Revision:
D21326386

Original commit changeset: f34875865c8b

fbshipit-source-id: 7b173ddc9c6f9d8d496e2bf3cd80bc9b85bda50a
2020-04-30 15:16:36 -07:00
1aedc2c5b9 Skip c2 ref onnx model tests (#37591)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37591

skip the tests since gluster is gone.

Test Plan: ci

Reviewed By: ezyang

Differential Revision: D21330359

fbshipit-source-id: a4e158fb72eddb08ba49fcfa9541569a150f8481
2020-04-30 14:32:47 -07:00
cd48fb5030 Vectorize linspace on CPU. (#27957)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/27957

Benchmark (gcc 8.3, Debian Buster, turbo off, Release build, Intel(R) Xeon(R) E-2136):

```python
import timeit
for dtype in ('torch.double', 'torch.float', 'torch.uint8', 'torch.int8', 'torch.int16', 'torch.int32', 'torch.int64'):
    for n, t in [(40_000, 50000),
                (400_000, 5000)]:
        print(f'torch.linspace(0, 10, {n}, dtype={dtype}) for {t} times')
        print(timeit.timeit(f'torch.linspace(0, 10, {n}, dtype={dtype})', setup=f'import torch', number=t))
```

Before:

```
torch.linspace(0, 10, 40000, dtype=torch.double) for 50000 times
1.3964195849839598
torch.linspace(0, 10, 400000, dtype=torch.double) for 5000 times
1.2374563289922662
torch.linspace(0, 10, 40000, dtype=torch.float) for 50000 times
1.8631796519621275
torch.linspace(0, 10, 400000, dtype=torch.float) for 5000 times
1.6991038109990768
torch.linspace(0, 10, 40000, dtype=torch.uint8) for 50000 times
1.8358083459897898
torch.linspace(0, 10, 400000, dtype=torch.uint8) for 5000 times
1.7214750979910605
torch.linspace(0, 10, 40000, dtype=torch.int8) for 50000 times
1.8356257299892604
torch.linspace(0, 10, 400000, dtype=torch.int8) for 5000 times
1.706238206999842
torch.linspace(0, 10, 40000, dtype=torch.int16) for 50000 times
1.7463878280250356
torch.linspace(0, 10, 400000, dtype=torch.int16) for 5000 times
1.6172360889613628
torch.linspace(0, 10, 40000, dtype=torch.int32) for 50000 times
1.8656846070080064
torch.linspace(0, 10, 400000, dtype=torch.int32) for 5000 times
1.714238062966615
torch.linspace(0, 10, 40000, dtype=torch.int64) for 50000 times
1.8272205490502529
torch.linspace(0, 10, 400000, dtype=torch.int64) for 5000 times
1.6409171230043285
```

After:

```
torch.linspace(0, 10, 40000, dtype=torch.double) for 50000 times
1.0077099470072426
torch.linspace(0, 10, 400000, dtype=torch.double) for 5000 times
0.8227124120458029
torch.linspace(0, 10, 40000, dtype=torch.float) for 50000 times
1.0058343949494883
torch.linspace(0, 10, 400000, dtype=torch.float) for 5000 times
0.8376779520185664
torch.linspace(0, 10, 40000, dtype=torch.uint8) for 50000 times
1.903041019977536
torch.linspace(0, 10, 400000, dtype=torch.uint8) for 5000 times
1.7576498500420712
torch.linspace(0, 10, 40000, dtype=torch.int8) for 50000 times
1.7628699769848026
torch.linspace(0, 10, 400000, dtype=torch.int8) for 5000 times
1.6204477970022708
torch.linspace(0, 10, 40000, dtype=torch.int16) for 50000 times
2.0970272019621916
torch.linspace(0, 10, 400000, dtype=torch.int16) for 5000 times
1.9493417189805768
torch.linspace(0, 10, 40000, dtype=torch.int32) for 50000 times
2.29020385700278
torch.linspace(0, 10, 400000, dtype=torch.int32) for 5000 times
2.1212510910118
torch.linspace(0, 10, 40000, dtype=torch.int64) for 50000 times
2.3479344319785014
torch.linspace(0, 10, 400000, dtype=torch.int64) for 5000 times
2.156775983981788
```

Test Plan: Imported from OSS

Differential Revision: D20773454

Pulled By: VitalyFedyunin

fbshipit-source-id: ebeef59a90edde581669cc2afcc3d65929c8ac79
2020-04-30 14:26:24 -07:00
2c33ea1c47 [doc] improve tensor.view doc (#36728)
Summary:
fix inaccurate formula. advertise `reshape` better.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36728

Differential Revision: D21094139

Pulled By: zou3519

fbshipit-source-id: 966ce5b84938b384584489040d2f132fee295bb4
2020-04-30 12:17:03 -07:00
3e1859959a Updating submodules
Summary:
GitHub commits:

2b59db7359
242186f5ff
4bf6682fe6
fe238e5438
e8cf50093e
17d9a609f2

Test Plan: n/a

Reviewed By: yns88

fbshipit-source-id: 3c1030ebc3768a827583b50c2d47fba494816943
2020-04-30 11:54:53 -07:00
13013848d5 Fix cpp_ext build dir create permission (#34239)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/34238
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34239

Differential Revision: D21328036

Pulled By: soumith

fbshipit-source-id: dac2735383b1a689139af5a23f61ccbebd1fd6c1
2020-04-30 11:30:07 -07:00
287f3b746e Remove Backend -> THPLayout mapping. (#37527)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37527

This is yet another place that needs to be updated for adding a new "Backend" and is unnecessary.  Instead, just use layout_from_backend and have a map from Layout -> THPLayout.

Other changes:
- rename torch::getDtype and torch::getLayout to torch::getTHPDtype and torch::getTHPLayout since e.g. for layout you are both passing in and returning a "layout" type.
- add NumOptions to Layout to match the dtype/ScalarType formulation.

Test Plan: Imported from OSS

Differential Revision: D21309836

Pulled By: gchanan

fbshipit-source-id: ede0e4f3bf7ff2cd04a9b17df020f0d4fd654ba3
2020-04-30 11:11:09 -07:00
8a30553738 [TensorPipe/RPC] Add TensorPipe dependency (#36695)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/36695

Reviewed By: lw

Differential Revision: D21312297

Pulled By: beauby

fbshipit-source-id: 39fdc3de91efa4ac97dd169f09fb304b273b0050
2020-04-30 11:05:15 -07:00
b97341e3dd [c2][opt] nomnigraph transform for ClipRangesGatherSigridHash fusion (#37535)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37535

Fuse ClipRanges + GatherRanges + SigridHash -> ClipRangesGatherSigridHash

dpa_product_ctr model's dper2 to dper3 migration is blocked by 3.6% higher prospector cpu usage. Root cause is traced down to sigrid transforms, where ClipRanges, GatherRanges, SigridHash are separately called, instead of fused, as is the case in dper2.

Further context:
https://fb.quip.com/GijaAZtX5mav
https://fb.quip.com/pIDdAjJP2uiG

Test Plan:
Local benchmarking with small model 181513584_0
(Dper3 full model is 178772812, dper2 refresh is 178770392)

Transform turned on: P129799373
Iters per second: 609.291

Transform turned off: P129799397
Iters per second: 519.088

We also want to confirm this performance on the full model in canary and in qrt.

`buck build mode/opt-clang mode/no-gpu caffe2/caffe2/fb/predictor:ptvsc2_predictor_bench`

`MKL_NUM_THREADS=1 OMP_NUM_THREADS=1 ./buck-out/opt/gen/caffe2/caffe2/fb/predictor/ptvsc2_predictor_bench --pred_net=/data/users/ansha/tmp/dpa/small_pred_net.pb --c2_model=/data/users/ansha/tmp/dpa/181513584_0.predictor --c2_inputs=/data/users/ansha/tmp/dpa/c2_inputs_small.pb --iters=3000 --warmup_iters=100 --num_threads=32 --c2_apply_nomnigraph_passes=1 --caffe2_predictor_enable_preproc_fusion=1`

Prospector canary:
https://our.intern.facebook.com/intern/ads/canary/426280288521552095/
Check that ClipRangesGatherSigridHash is used: https://fburl.com/scuba/caffe2_operator_stats_canary/e6qfdsat

Reviewed By: yinghai

Differential Revision: D21262085

fbshipit-source-id: 2c2481e3d4977abb8abe6e9ef0c9999382320ab2
2020-04-30 11:03:47 -07:00
20ba29d81c Add support for reductions on CPU in tensorexpr (#37333)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/37333

Differential Revision: D21290289

Pulled By: resistor

fbshipit-source-id: ebba11f7af9e22b48c47e2eefb9497fa77acd17d
2020-04-30 10:59:38 -07:00
d3d10cc14a Add tests for lower_graph and fix unpack() ops dispatch (#37540)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37540

ghstack-source-id: 103169129

Test Plan:
buck test mode/no-gpu mode/dev //caffe2/test:jit -- 'test_lower_graph_conv \(test_jit\.TestScript\)'
buck test mode/no-gpu mode/dev //caffe2/test:jit -- 'test_lower_graph \(test_jit\.TestScript\)'

Differential Revision: D21313433

fbshipit-source-id: bb9942272784e517b07537ee4c149b9dc4df4c2a
2020-04-30 10:55:05 -07:00
149b468ce2 [TensorBoard] Fixes missing doc for add_graph (#37504)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/37415
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37504

Differential Revision: D21305979

Pulled By: natalialunova

fbshipit-source-id: 0606a39a6006a236a37a0e6df87959736474547f
2020-04-30 10:26:19 -07:00
c5624e831d Add overloads of std:: math functions for c10::complex [resubmit] (#37468)
Summary:
This reverts commit d167a7f6542ca751de0d5bd76653a587f97906f8.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37468

Differential Revision: D21305110

Pulled By: anjali411

fbshipit-source-id: d1bdc9d9feac00331fc2b2b905d49f80bef680f9
2020-04-30 10:20:45 -07:00
e9db16e0c1 [Reland] Implement cusparse Descriptor class and clean up cusparse code (#37533)
Summary:
Reland of https://github.com/pytorch/pytorch/issues/37389. Fix for the cuda 10.1 CI failure.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37533

Differential Revision: D21326386

Pulled By: ezyang

fbshipit-source-id: f34875865c8bad76163995c18d88b0e76656bb22
2020-04-30 10:12:44 -07:00
5bb01568c3 speed up and re-enable quantized bn unit tests (#37420)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37420

The quantized BN unit tests were disabled because they took too long.
This diff removes hypothesis from these test cases and instead generates
the cases manually.  The run time is ~7 seconds per test on my devgpu.

Test Plan:
```
python test/test_quantization.py TestQuantizedOps.test_batch_norm2d_relu
python test/test_quantization.py TestQuantizedOps.test_batch_norm3d
```

Imported from OSS

Differential Revision: D21310333

fbshipit-source-id: 2499f7a3d6a87c0278d012ae65132f148cee6d2e
2020-04-30 09:44:44 -07:00
f09eb391b9 Move masked_select broadcasting from codegen layer to native layer. (#37543)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/37543

Test Plan: Imported from OSS

Differential Revision: D21315038

Pulled By: gchanan

fbshipit-source-id: 66e255a9a696db189154605c84dca2a0f3b9ee5c
2020-04-30 09:12:09 -07:00
69e2f1aaff [cmake] add HAVE_SOVERSION option (default=OFF). (#37502)
Summary:
This is useful for linux distributions when the ABI/API of libtorch has
been changed. The default SOVERSION is set to
"${TORCH_VERSION_MAJOR}.${TORCH_VERSION_MINOR}".

ezyang

But if the release strategy of pytorch/caffe2 involves avoiding breaking API/ABI changes to libtorch for minor/patch releases, then we can set `TORCH_SOVERSION` to simply `TORCH_VERSION_MAJOR`. Please confirm that.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37502

Differential Revision: D21303565

Pulled By: ezyang

fbshipit-source-id: 798f5ec7fc5f0431ff1a7f9e8e5d3a0d3b25bb22
2020-04-30 06:52:33 -07:00
4c8636c74c Unify the path for environment restore script (#37486)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/37486

Differential Revision: D21326213

Pulled By: ezyang

fbshipit-source-id: 6f0fd07da51439e999026593951396bfc26a2abf
2020-04-30 06:42:27 -07:00
6792bafa72 [pytorch] aten codegen to filter backends for default mobile build
Summary:
This is a simple change to mitigate the OSS mobile default build size regression caused by #34275 and #34622.

Mobile supported backends are already kinda hard-coded in function_wrapper.py as `static_dispatch_backends`:
https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/function_wrapper.py#L243

This is simply to align dynamic registration with static dispatch for mobile build.

To measure mobile build size:
```
// Default mobile build:
scripts/build_pytorch_android.sh armeabi-v7a

// MobileNetV2 custom build:
SELECTED_OP_LIST=MobileNetV2.yaml scripts/build_pytorch_android.sh armeabi-v7a
```

- arm-v7 Android AAR (compressed) size:
```
+----------+-------------------+---------------+
|          | MobileNetV2 Build | Default Build |
+----------+-------------------+---------------+
| Original |         3,354,589 |     5,731,992 |
| #34275   |         3,404,978 |     6,640,526 |
| #34622   |         3,432,569 |     6,640,526 |
| This PR  |         3,431,660 |     6,534,135 |
+----------+-------------------+---------------+
```

Differential Revision: D20415107

Test Plan: Imported from OSS

Pulled By: ljk53

fbshipit-source-id: 75acf4dc5dfe9242c01b2db0b84bd6b4a1d0cd8d
2020-04-30 01:35:38 -07:00
68250fa557 Vanilla Pytorch bionic clang9 test in CI (#36711)
Summary:
fixes https://github.com/pytorch/pytorch/issues/36676
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36711

Differential Revision: D21148986

Pulled By: ailzhang

fbshipit-source-id: e383bd50c893b5eee850eab65aa4681a6b491706
2020-04-29 23:07:46 -07:00
9d0891f886 [pytorch][buck] tweak code analyzer e2e script
Summary:
- Add debug mode to include debug information.
- Move codegen comment to FB shell script (as it's only checked-in FB repo).
- Analyze lite-predictor instead of full-JIT as full-JIT BUCK target contains variable kernels thus pull in a lot more dependencies.
- Use pre-opt bitcode instead of pre-codegen bitcode - there is one special `callOp()` case in RNN.cpp where optimized bitcode has opname string and API body inlined together: https://fburl.com/diffusion/8rz6u4rg; pre-optimization bitcode should give more stable result.

Test Plan: - Tested the bash script with stacked diff.

Reviewed By: iseeyuan

Differential Revision: D21298837

fbshipit-source-id: be33e2db5d8cb0f804460c503e52beb0dcb4857f
2020-04-29 22:38:09 -07:00
ac5403f22e [quant] Check qengine for TestNormalization (#37562)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37562

The model has a LinearLayer which needs fbgemm. Fixes failing windows test.

Test Plan:
python test/test_quantization.py TestPostTrainingStatic

Imported from OSS

Differential Revision: D21321032

fbshipit-source-id: 1671fdef5d0a1b43e2a4e703a8852d522af32288
2020-04-29 22:16:43 -07:00
091a1192d7 [JIT] Convert float Tensor argument to double in prim::tolist (#37465)
Summary:
**Summary**
Converting a float `Tensor` to a Python list is not supported because
Python's float is actually a double. This commit modifies the
implementation of `prim::tolist` so that it converts an input argument
that is a float Tensor into a double Tensor and emits a warning.

**Test Plan**
Modified and ran the corresponding unit test.

*Before*
```
======================================================================
ERROR: test_to_list (jit.test_list_dict.TestList)
Unit tests for Tensor.tolist() function.
----------------------------------------------------------------------
...
RuntimeError: The following operation failed in the TorchScript interpreter.
Traceback of TorchScript (most recent call last):
RuntimeError: Output annotation element type and runtime tensor element type must match for tolist()
----------------------------------------------------------------------
Ran 1 test in 0.151s

FAILED (errors=1)
```

*After*
```
UserWarning: Converting float Tensor to double because tolist is only supported for double type Tensors (Triggered internally at  ../torch/csrc/jit/runtime/register_prim_ops_fulljit.cpp:626.)
  return callable(*args, **kwargs)
.
----------------------------------------------------------------------
Ran 1 test in 0.210s

OK
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37465

Differential Revision: D21311829

Pulled By: SplitInfinity

fbshipit-source-id: a0c1796013e35baf8d7641af271424a10e26f161
2020-04-29 21:17:19 -07:00
eb5590d6f4 Updating submodules
Summary:
GitHub commits:

19f74a96e6
ac50e17058
2d1a80916f
b9587ed249
98b76fab50
b938e6042b
e661b714ec
7dfd2114cb
a2138b57e5
bbd7e74dd9
b2bdf8486e

Test Plan: n/a

Reviewed By: yns88

fbshipit-source-id: 1472de3aa441e87859255ad3a73156c180078a1f
2020-04-29 20:19:03 -07:00
a0075c4825 [XNNPACK] Disable xnnpack ops for both iOS and macOS (#37528)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37528

The XNNPACK is currently running into some threading issues using the pthreadpool on iOS. The AIBench has been failing since 4/23. The fix could take days to land, so disable it for the time being.
ghstack-source-id: 103151011

Test Plan:
- ` buck run mode/mac aibench:run_bench_macos -- -b aibench/specifications/models/pytorch/pytext/pytext_mobile_inference.json --platform ios --framework pytorch --remote --devices D201AP-12.0.1`

- `buck test PyTorchPlaygronud`

```
RESULTS FOR //fbobjc/Apps/Internal/PyTorchPlayground:PyTorchPlaygroundTests //fbobjc/Apps/Internal/PyTorchPlayground:verify_no_third_party_license_PyTorchPlayground_test //tools/build_defs/apple/plugins:plugin_function_tests //xplat/configurations/buck/apple/plugins/tests:FBPluginSmokeTests
PASS     316ms  2 Passed   0 Skipped   0 Failed   PyTorchBITests
PASS      1.5s  1 Passed   0 Skipped   0 Failed   PyTorchFBNetTests
PASS    <100ms  1 Passed   0 Skipped   0 Failed   //fbobjc/Apps/Internal/PyTorchPlayground:verify_no_third_party_license_PyTorchPlayground_test
PASS      2.1s  1 Passed   0 Skipped   0 Failed   com.facebook.starlark.testing.plugin_function_tests
PASS    <100ms  1 Passed   0 Skipped   0 Failed   FBPluginCovariantSanityTests
PASS    <100ms  5 Passed   0 Skipped   0 Failed   FBPluginEnumSanityTests
PASS    <100ms  3 Passed   0 Skipped   0 Failed   FBPluginFunctionSanityTests
PASS    <100ms  8 Passed   0 Skipped   0 Failed   FBPluginListLookupSanityTests
PASS    <100ms  5 Passed   0 Skipped   0 Failed   FBPluginSortedBySanityTests
PASS    <100ms  5 Passed   0 Skipped   0 Failed   FBPluginSplitSchemaSanityTests
Updated test logs: buck-out/log/test.log
TESTS PASSED
```

Reviewed By: kimishpatel, iseeyuan

Differential Revision: D21309715

fbshipit-source-id: a09a7f20d6b9d995c8fb54fe44bd9b33884c78d1
2020-04-29 20:05:14 -07:00
482d1f4b8c [quant][graphmode] fix observer instance copy (#37185)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37185

Previously observer instances are sharing the same Tensor attributes, it is
OK if we don't do inplace operations on these attributes, but will become a problem
when people do inplace changes.
This PR uses deepcopy instead of clone_instance which will copy the tensor for each instance.

Test Plan:
.

Imported from OSS

Differential Revision: D21309084

fbshipit-source-id: afd974b0c97886fbab815e9c711c126379fe3e17
2020-04-29 19:51:29 -07:00
a961d3acf3 graph mode: add handling for layer_norm op (#37525)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37525

Adds graph mode handling for the layer_norm op

Test Plan:
```
python test/test_quantization.py TestQuantizeScriptPTSQOps.test_layer_norm
```

Imported from OSS

Differential Revision: D21310360

fbshipit-source-id: 83f475fb30b89c29623b79a6a6bfd5c20c569b51
2020-04-29 19:43:59 -07:00
4e7403c286 graph mode: add hardswish op (#37524)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37524

Adds graph mode handling for the hardswish op.

Test Plan:
```
python test/test_quantization.py TestQuantizeScriptPTSQOps.test_hardswish
```

Imported from OSS

Differential Revision: D21310365

fbshipit-source-id: 07e79943ed8095a5220be1582867b33597b85855
2020-04-29 19:43:53 -07:00
7ac98c9396 graph mode: refactor quantized hardswish API for easier graph handling (#37523)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37523

Makes the quantized hardswish function API more suited to graph mode
handling, which will come in the next PR.

Test Plan:
CI

Imported from OSS

Differential Revision: D21310364

fbshipit-source-id: 0d438dce5b87481d558c07bcccd9fe717200b4dc
2020-04-29 19:43:48 -07:00
11b6f70f7d graph mode: add hardsigmoid op (#37522)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37522

Adds hardsigmoid op to graph mode handling.

Test Plan:
CI

Imported from OSS

Differential Revision: D21310363

fbshipit-source-id: 4d9f3bb032fb5a4d8f0cf84bff230fc1ce222c3c
2020-04-29 19:43:43 -07:00
6cdc8cac47 graph mode: add elu op (#37521)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37521

Adds ELU to graph mode handling.

Test Plan:
CI

Imported from OSS

Differential Revision: D21310361

fbshipit-source-id: 045fc3af796dea67e0153255648fe5911e70bbed
2020-04-29 19:43:38 -07:00
400098d492 graph mode: add hardtanh op (#37469)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37469

Adds graph mode handling for hardtanh op.

Test Plan:
CI

Imported from OSS

Differential Revision: D21310362

fbshipit-source-id: 63e4ffca5cf2f345c2a66f84db2193a5e14b1028
2020-04-29 19:42:24 -07:00
b33b46a950 [quant] Enable qnnpack tests for test_quantize and test_numeric_suite (#37351)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/37351

Test Plan:
python test/test_quantization.py PostTrainingStaticQuant

Imported from OSS

Differential Revision: D21293704

fbshipit-source-id: 621f3ac60315b61f99b9b41da691ac3473e974cc
2020-04-29 19:28:22 -07:00
b48239af3c Cleanup internal functions in python_functions.cpp (#37536)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/37536

Test Plan: Imported from OSS

Differential Revision: D21312500

Pulled By: mrshenli

fbshipit-source-id: 52293323d2083d300712b1811cb6784419ea441c
2020-04-29 19:12:46 -07:00
322e564ee3 Minor format cleanup in py_rref.cpp (#37520)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/37520

Test Plan: Imported from OSS

Reviewed By: xush6528

Differential Revision: D21308889

Pulled By: mrshenli

fbshipit-source-id: 36d5efc4d9c3e6cc0b2abec35675a338a2f81424
2020-04-29 19:12:40 -07:00
d5b38984c8 Let RPC return FutureIValue instead of FutureMessage (#37519)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37519

closes #37446

Currently FutureMessage is used in several places:

1. `rpc_async` returns a `FutureMessage` object and we expose it
   as `torch.distributed.rpc.Future`. From applications perspective,
   they are expecting a `py::object` instead of a `Message`, and we
   do the conversion in the `Future.wait()` pybind method.
2. RPC autograd profiler takes `FutureMessage` and installs
   callbacks to it. The profiler actually only need a `Future<T>`
   and does not care what `T` is.
3. `OwnerRRef` exposes a `getFuture()` API which returns a
   `FutureMessage`. This `FutureMessage` will be marked completed
   when the value referenced by the `OwnerRRef` is ready.
   `OwnerRRef` does not need it to be a Message type either, it
   actually creates an empty `Message` to mark the `Future`.

The above places are using `FutureMessage`, but they don't really
need a `Message`, and `Message` is a communication layer type that
applications or profiler or the RRef shouldn't be aware of.

Another motivation for making this change is that for async RPC
UDF #36071, we are going to allow application to call
`markCompleted` in Python. If we still use `FutureMessage`, then
in the `markCompleted` pybind function, it needs to convert the
provided `py::object` into a specific message type, which is
leaking communication layer code to pybind functions. Even if
this is doable, we will have two entities (RPC agent and pybind
Python frontend) accessing the same request callback logic. This is too messy.

This commit replaces all surface `FutureMessage` with `FutureIValue`,
so that `FutureMessage` is no longer visible from Python land. Note
that this does not cause BC issues, as the Python Future type name
and its API stay intact. Internally, we still have `FutureMessage`
in the communication layer.

Test Plan: Imported from OSS

Reviewed By: xush6528

Differential Revision: D21308887

Pulled By: mrshenli

fbshipit-source-id: 4f574f38e83125081f142813cfdde56119522089
2020-04-29 19:10:29 -07:00
e9db51f9af Enable float requantization for avgpool/gavgpool ops. (#37037)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37037

For avgpool and gavgpool change requantization scheme:
Similar to conv and linear now convert the accumulated int32
values to float, apply requantization scale which includes the averaging
multiplier.
Conver the resulting float value to int32.
Add output_zero_point.

Benchmark numbers compared to baseline:
% speedup on pixel XL.
------------------------------
|                 | aarch32 | aarch64|
|avgpool   | .4            | 13.6       |
|gavgpool | -2.6%    | 3.5%      |
-------------------------------

Test Plan:
Tested via q8avgpool-test, q8gavgpool-test, average-pooling-test and
global-average-pooling-test in PT QNNPACK.
Also via integated test_quantized.py.
python test/quantization/test_quantized.py

Imported from OSS

Differential Revision: D21168981

fbshipit-source-id: 9060324304603ca7fd380c788a87b01a6d586c5c
2020-04-29 18:56:43 -07:00
d5363e6499 Set onnx opset version before model select (#37466)
Summary:
Set opset version before model select call - which is used to trigger warnings.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37466

Reviewed By: hl475

Differential Revision: D21308796

Pulled By: houseroad

fbshipit-source-id: 0974b9d5b6562d4451f54053138174f663a17aa3
2020-04-29 17:37:09 -07:00
1ef992639d Make c10::complex the C++ type for complex tensors (#37421)
Summary:
# Overview

This PR changes the backing type of complex tensors in `ScalarType` from `std::complex` to `c10::complex`.

Since `c10::complex` and `std::complex` are reinterpret-castable, we can freely use `std::complex *` to access `c10::complex` data and vice versa. The implementation of `c10::complex` is not complete yet, so we are reinterpret casting all complex data to `std::complex` during dispatch, and do all operations in `std::complex`.

# `std::complex` and `c10::complex` interoperatability

To use `std::complex *` to access  `c10::complex` data, the following specializations are added:
```C++
template <> inline std::complex<float>* Tensor::data_ptr();
template <> inline std::complex<double>* Tensor::data_ptr();
template <> inline std::complex<float> Tensor::item();
template <> inline std::complex<double> Tensor::item();
```

See [`aten/src/ATen/templates/TensorMethods.h`](https://github.com/pytorch/pytorch/pull/37274/files#diff-0e8bf6f5024b32c240a4c1f0b4d8fd71)

And

```C++
template <> inline std::complex<float> Scalar::to();
template <> inline std::complex<double> Scalar::to();
```

is added in [`c10/core/Scalar.h`](https://github.com/pytorch/pytorch/pull/37274/files#diff-aabe1c134055c8dcefad830c1c7ae957)

# Dispatch

Macros in [`Dispatch.h`](https://github.com/pytorch/pytorch/pull/37274/files#diff-737cfdab7707be924da409a98d46cb98) still using `std::complex` as its type. We will add macros such as `AT_DISPATCH_ALL_TYPES_AND_C10_COMPLEX_AND3` as needed during the migration and not in this PR.

Note that `AT_DISPATCH_ALL_TYPES_AND_COMPLEX_AND3` is only used in copy kernel of CUDA, and this PR is already changing it to use `c10::complex` because CUDA copy kernel has to use its original dtype otherwise there will be funny casting of dtypes causing cuda unspecified launch error.

When all the migration is done, the c10 version of macros will be removed, and the default version will have `std::complex` replaced by `c10::complex` by default. This design allows us to incrementally migrate from `std::complex` to `c10::complex`.

# Note

Note that the `std::complex` is not completely replaced by `c10::complex` in c10 yet, for example `c10::Scalar` is still using `std::complex`. This will be fixed in later PRs.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37421

Differential Revision: D21282161

Pulled By: anjali411

fbshipit-source-id: 635e309e8c8a807c2217723ad250b5ab5a20ce45
2020-04-29 16:42:49 -07:00
5bb9357345 Update assertion in MHA forward to support FP16 training (#37539)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37539

Bug fix

Test Plan:
This passed fbtranslate local integration test when I toggle fp16 to true on GPU.

Also it passed in with D21312488

Reviewed By: zhangguanheng66

Differential Revision: D21311505

fbshipit-source-id: 7ebd7375ef2c1b2ba4ac6fe7be5e7be1a490a319
2020-04-29 16:29:23 -07:00
896f8130a6 Revert D21297549: [jit] fix trace checking reporting divergent names
Test Plan: revert-hammer

Differential Revision:
D21297549

Original commit changeset: 981d5879a4a2

fbshipit-source-id: 9be6e88007c644914973a305f9e7a961ef11a815
2020-04-29 16:16:44 -07:00
f1cd0eeb70 IValue(bool) constructor should initialize entire payload (#37513)
Summary:
Closes https://github.com/pytorch/pytorch/issues/37117
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37513

Differential Revision: D21310444

Pulled By: malfet

fbshipit-source-id: 6ecb3504c13688d42daed4c847c919171d368830
2020-04-29 15:59:29 -07:00
7e9cc4df85 Migrate cos and cos_ from TH to ATen (CUDA) (#36653)
Summary:
Benchmark with same build settings on same system.

Closes https://github.com/pytorch/pytorch/issues/24545
gcc : version 7.5.0 (Ubuntu 7.5.0-3ubuntu1~18.04)
CUDA : 10.1
GPU : 1050ti

```python
import timeit

for n, t in [(10_000, 20000),
             (100_000, 20000)]:
    for dtype in ('torch.half', 'torch.float', 'torch.double'):
        print(f'torch.cos(a) a.numel() == {n} for {t} times {dtype}')
        print(timeit.timeit(f'torch.cos(a); torch.cuda.synchronize()',
                             setup=f'import torch; a=torch.arange({n}, dtype={dtype}, device="cuda")',
                             number=t))
```

Before:

```
torch.cos(a) a.numel() == 10000 for 20000 times torch.half
0.2797315450006863
torch.cos(a) a.numel() == 10000 for 20000 times torch.float
0.283109110998339
torch.cos(a) a.numel() == 10000 for 20000 times torch.double
0.3648525129974587
torch.cos(a) a.numel() == 100000 for 20000 times torch.half
0.34239949499897193
torch.cos(a) a.numel() == 100000 for 20000 times torch.float
0.33680364199972246
torch.cos(a) a.numel() == 100000 for 20000 times torch.double
1.0512770260102116
```

After:

```
torch.cos(a) a.numel() == 10000 for 20000 times torch.half
0.285825898999974
torch.cos(a) a.numel() == 10000 for 20000 times torch.float
0.2781305120001889
torch.cos(a) a.numel() == 10000 for 20000 times torch.double
0.34188826099989456
torch.cos(a) a.numel() == 100000 for 20000 times torch.half
0.29040409300023384
torch.cos(a) a.numel() == 100000 for 20000 times torch.float
0.28678944200009937
torch.cos(a) a.numel() == 100000 for 20000 times torch.double
1.065477349000048
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36653

Differential Revision: D21164675

Pulled By: VitalyFedyunin

fbshipit-source-id: 5dd5d3af47c2a5527e1f4ab7669c2ed9a2293cee
2020-04-29 15:52:24 -07:00
6098cf7e33 Add sched_setaffinity check from libgomp to valgrind.sup (#37532)
Summary:
- It's valid to call `sched_setaffinity` with nullptr
- The call is coming from libomp which should be valgrind safe
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37532

Test Plan: CI

Differential Revision: D21311252

Pulled By: malfet

fbshipit-source-id: a325f97741b997738c35759d02fcc34c1cb44d95
2020-04-29 14:48:23 -07:00
bca82801e7 add support for generating Vandermonde matrices (#36725)
Summary:
Adds support for generating Vandermonde matrices based off of the Numpy implementation found [here](https://github.com/numpy/numpy/blob/v1.17.0/numpy/lib/twodim_base.py#L475-L563).

Adds test to ensure generated matrix matches expected Numpy implementation. Note test are only limited to torch.long and torch.double due to differences in now PyTorch and Numpy deal with type promotion.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36725

Differential Revision: D21075138

Pulled By: jessebrizzi

fbshipit-source-id: 6bb1559e8247945714469b0e2b07c6f4d5fd1fd0
2020-04-29 13:16:26 -07:00
f7dce8508c Revert D21302691: [pytorch][PR] Implement cusparse Descriptor class and clean up cusparse code
Test Plan: revert-hammer

Differential Revision:
D21302691

Original commit changeset: ecbb4063466c

fbshipit-source-id: 56ae47273691a12cc8d96635fb4ad9d09080ccc9
2020-04-29 12:57:02 -07:00
297cc5512e [quant] Enable convolution tests (#37494)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/37494

Test Plan: Imported from OSS

Differential Revision: D21299442

Pulled By: z-a-f

fbshipit-source-id: 68513b52aaef852278f28031866f85123b016486
2020-04-29 12:24:45 -07:00
ec5fb29b96 Add overload names to dict operators. (#37279)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/37279

Test Plan: Imported from OSS

Differential Revision: D21243579

Pulled By: iseeyuan

fbshipit-source-id: 2006bc15bdebd325ece037150065fe5b25f0cbc1
2020-04-29 12:10:28 -07:00
9e97e9244f Fix mobile type resolution in unpickling (#37425)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37425

The mobile type resolver that we inject into the unpickler currently
creates a dummy type for everything, even built-in types like
List[int]. This PR restricts that behavior to types that start with
`__torch__`, and uses the mobile type parser for everything else.

I don't like this solution because it relies on a fragile invariant that
all "class-like" types have qualified names that start with `__torch__`.
I think the long term solution is to just re-use the script type parser
here.

Test Plan: Imported from OSS

Differential Revision: D21291331

Pulled By: suo

fbshipit-source-id: c94709bcbd1bac75336e033fd9d3afa6656b0a77
2020-04-29 12:03:46 -07:00
a3ab560f6c Port xnnpack operators to new registration API (#36800)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36800

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Differential Revision: D21089650

Pulled By: ezyang

fbshipit-source-id: 1babdb5524038e3951d3c4303e4ba87e68b4f138
2020-04-29 11:29:23 -07:00
867e05921f Fix multiple issues with type annotations (#36358)
Summary:
- added tests that showcase the problems
- fixed the problems

These changes would allow me to remove many "# type: ignore" comments in my codebase.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36358

Differential Revision: D21230704

Pulled By: ezyang

fbshipit-source-id: e6d475a0aa1fb40258fa0231ade28c38108355fb
2020-04-29 11:16:39 -07:00
bbf29a5239 Implement cusparse Descriptor class and clean up cusparse code (#37389)
Summary:
Add cusparse Descriptor class. Add cusparse Generic API wrapper. Clean up current cuda sparse code.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37389

Differential Revision: D21302691

Pulled By: ezyang

fbshipit-source-id: ecbb4063466c616eebfe681f1622724692be505c
2020-04-29 11:08:07 -07:00
1bb66a0cd4 Extend some of the basic ops to kHalf (#37121)
Summary:
Added enough operators to make sure that all unit tests from ATen/basic are passing, except for MM and IntArrayRefExpansion
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37121

Test Plan: `./bin/basic --gtest_filter=--gtest_filter=BasicTest.BasicTestHalfCPU` + `python -c "import torch; x = torch.tensor([2], dtype=torch.half); print(torch.isfinite(x+x))"`

Differential Revision: D21296863

Pulled By: malfet

fbshipit-source-id: e03d7a6939df11f611a9b317543bac52403cd009
2020-04-29 10:49:16 -07:00
bbd2350c99 Disable tests failing on test2 in ROCm CI (#37427)
Summary:
This pull request disables the unit tests that were observed to be failing once `test2` was enabled. These tests will be one by one looked at and fixed at the earliest, but until then disabling them to unblock `test2`
The pull request also disables fftPlanDestroy for rocFFT to avoid double-freeing FFT handles

cc: ezyang jeffdaily
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37427

Differential Revision: D21302909

Pulled By: ezyang

fbshipit-source-id: ecadda3778e65b7f4f97e24b932b96b9ce928616
2020-04-29 09:56:28 -07:00
58a46a174e [cmake] add USE_SYSTEM_{XNNPACK,ONNX} options. (#37501)
Summary:
ezyang
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37501

Differential Revision: D21303527

Pulled By: ezyang

fbshipit-source-id: 58353d78c66e5bcc9198ce8cde36ac7232bb4b2f
2020-04-29 09:26:16 -07:00
0d9e3b48c4 Remove THCudaMemGetInfo. Use c10's cacheInfo instead. (#37447)
Summary:
`THCudaMemGetInfo` has only been used in `aten/src/ATen/native/cudnn/Conv.cpp`. We can extract `c10::cuda::CUDACachingAllocator::cacheInfo` out from it and use it in `aten/src/ATen/native/cudnn/Conv.cpp` directly and drop lines that are not used in `THCudaMemGetInfo`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37447

Differential Revision: D21302770

Pulled By: ezyang

fbshipit-source-id: 41ad68b8fd5ecc7bc666a6861789c6c1f743f420
2020-04-29 09:20:26 -07:00
68895eda9d add fmt, take 7 (#37356)
Summary:
fmt is a formatting library for C++. It has several properties that make it nice
for inclusion in PyTorch:
- Widely used
- Basically copies how Python does it
- Support for all the compilers and platforms we care about
- Standards track (C++20)
- Small code size
- Header only

This PR includes it as a submodule and sets up the build.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37356

Differential Revision: D21262619

Pulled By: suo

fbshipit-source-id: 1d9a1a5ed08a634213748e7b02fc718ef8dac4c9
2020-04-29 09:08:24 -07:00
d37a4861b8 Explicit attribute setting for pruning and weight_norm upon reparam removal (#34170)
Summary:
To address one of the problems with RNNs that emerged in https://github.com/pytorch/pytorch/issues/33618, I modified the `remove` methods in `torch.nn.utils.prune` and `torch.nn.utils.weight_norm` to make an explicit call to `setattr`, which, in `rnn.py` directly modifies `_flat_weights` (https://github.com/pytorch/pytorch/blob/master/torch/nn/modules/rnn.py#L96) to include the new element.

This is important so that `_flat_weights` can reflect the presence of the `Parameter` after the (pruning or weight norm) reparametrization is removed. Without this, the weight in `_flat_weights` would remain a tensor, as originally set by the reparametrization.

Simple testing is added, which depends on the current naming scheme for the LSTM module.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34170

Differential Revision: D21265965

Pulled By: mickypaganini

fbshipit-source-id: 29de4a6b17052d42ccfe67c8560b7f83c20fd09d
2020-04-29 09:01:59 -07:00
6176931695 Disable stateless xnnpack for ios. (#37460)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37460

It seems that stateless xnnpack integration for Convolution is breaking iOS
runs.
Issue seems to be stemming from passing some invalid pointer or pointer that is
not longer valid. But beyond this the issue has not been root caused.
The issues seems to appear only for iOS so far but blanket disabling it for
both ios and android since this improvement had only been recent so no
production models are running with this perf improvement yet. Hence no perf
regression expected.

Test Plan: buck run aibench:run_bench -- -b aibench/specifications/models/pytorch/pytext/pytext_mobile_inference.json --platform ios --framework pytorch --remote --devices D221AP-12.0.1

Reviewed By: xta0

Differential Revision: D21284385

fbshipit-source-id: 1fe01e3a476b340697972743dadf64333cc86b3f
2020-04-29 08:24:24 -07:00
cyy
9259a283b7 use detected python version to find pylibs (#34041)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/34041

Differential Revision: D21302552

Pulled By: ezyang

fbshipit-source-id: 140c3d2146bad8feb425cf3670cffdbabc5101b1
2020-04-29 08:17:15 -07:00
ec8517b6df Move exponential_() to DistributionTemplates (#37456)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37456

Fixes #37370

Test Plan: Imported from OSS

Differential Revision: D21290781

Pulled By: pbelevich

fbshipit-source-id: 2f516b5112b9ce1c9ba8967b3758decf86d65676
2020-04-29 08:07:35 -07:00
06168bf17d Move geometric_() to DistributionTemplates (#37418)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37418

Fixes #37369

Test Plan: Imported from OSS

Differential Revision: D21290757

Pulled By: pbelevich

fbshipit-source-id: 42133f35edcbe716a07987bef2e68a4cdc27236a
2020-04-29 08:07:30 -07:00
ce6077d7a8 Move log_normal_() to DistributionTemplates (#37392)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37392

Fixes #37368

Test Plan: Imported from OSS

Differential Revision: D21290740

Pulled By: pbelevich

fbshipit-source-id: 15a76b2625d2ca8187c25333a86eecd111a259c6
2020-04-29 08:06:05 -07:00
253943d5a7 Remove thrust_t from remainder_kernel_cuda (#37470)
Summary:
complex is not supported, so no need to use thrust
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37470

Differential Revision: D21296501

Pulled By: anjali411

fbshipit-source-id: bf2075ac933a793b9cdddcda0918604e7574ee2d
2020-04-29 04:01:17 -07:00
1b525f88ce Print all ops in model converter
Summary:
It will be convenient to print ops names when converting the model in xplat.

This diff moves export_opnames to export_module.cpp so it can be used in xplat (caffe2:optimize_for_mobile and caffe2:torch_train). This function was in caffe2/torch/csrc/jit/serialization/export.cpp. I tried to create a target to include this file but it involves too many ONNX deps and I cannot get it to work.

Test Plan: local test, verified op names are printed

Reviewed By: iseeyuan

Differential Revision: D20961557

fbshipit-source-id: 293569081b29c263c1c441df7a63838a81560ce9
2020-04-29 02:14:59 -07:00
bf53784e3c Treat cross-execution-space-call as errors for NVCC on Windows (#37302)
Summary:
On Windows, when you call those unsupported functions like `std::pow`, `std::isnan` or `std::isinf` in the device function and compile, a warning is thrown:
```
kernel.cu
kernel.cu(39): warning: calling a __host__ function from a __host__ __device__ function is not allowed

kernel.cu(42): warning: calling a __host__ function from a __host__ __device__ function is not allowed

kernel.cu(39): warning: calling a __host__ function("isnan<double> ") from a __host__ __device__ function("test_") is not allowed

kernel.cu(42): warning: calling a __host__ function("isinf<double> ") from a __host__ __device__ function("test_") is not allowed
```
However, those calls will lead to runtime errors, see https://github.com/pytorch/pytorch/pull/36749#issuecomment-619239788 and https://github.com/pytorch/pytorch/issues/31108.  So we should treat them as errors.
Previously, the situation is worse because the warnings are turned off by passing in `-w`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37302

Differential Revision: D21297207

Pulled By: ngimel

fbshipit-source-id: 822b8a98c10e54c38319674763b6681db21c1021
2020-04-29 01:52:52 -07:00
4bfa51d405 [jit] fix trace checking reporting divergent names (#37464)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37464

Fixes https://github.com/pytorch/pytorch/issues/23993.

There are two fixes here:
1. Previously our name lookup function for the tracer was looking in
f.globals for names. For example:
```
sample = torch.ones(1)
traced = torch.jit.trace(my_mod, ((sample, sample,),))
# produces a graph with something like
# %sample, %sample = prim::TupleUnpack(%input)
```
This is not great if you are, e.g. trace checking, because a non-local
bit of interpreter state is affected the graph produced:
```
traced = torch.jit.trace(my_mod, _clone_inputs((sample, sample,),))
# produces a graph with something like
# %0, %1 = prim::TupleUnpack(%input)
```
I have removed this functionality, as I don't think it provides huge
value. Things that look locally for names will still work, so e.g.
inputs, intermediate variables, and the like will be named correctly.

2. Previously, our input cloning for trace checking didn't do a memoized
deep copy. So:
```
_clone_inputs((sample, sample, sample))
```
produces a tuple with three non-aliased tensors. That's wrong! Use
copy.deepcopy with a memoization argument to fix this.

Test Plan: Imported from OSS

Differential Revision: D21297549

Pulled By: suo

fbshipit-source-id: 981d5879a4a244520dd68489767129ff357f1497
2020-04-28 23:52:57 -07:00
a55d80e1c5 [JIT] remove dominated guards of functional values (#37105)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37105

If a value isn't mutated anywhere and is guarded by a node, then we can remove all other guards that are dominated by the first guard.

This reduces the number of (test name, Ifs/Loops, non-tensor nodes excluding getAttr and Bailouts) from the previous PR for the following tests:
```
Before:  ('upsample', 0, 13)
After:  ('upsample', 0, 5)
Before:  ('upsample', 0, 2)
After:  ('upsample', 0, 1)
Before:  ('interpolate', 0, 1)
After:  ('interpolate', 0, 0)
Before:  ('interpolate', 0, 12)
After:  ('interpolate', 0, 4)
Before:  ('interpolate', 0, 1)
After:  ('interpolate', 0, 0)
Before:  ('interpolate', 0, 1)
After:  ('interpolate', 0, 0)
Before:  ('interpolate', 0, 12)
After:  ('interpolate', 0, 4)
Before:  ('interpolate', 0, 1)
After:  ('interpolate', 0, 0)
Before:  ('interpolate', 0, 1)
After:  ('interpolate', 0, 0)
Before:  ('interpolate', 0, 12)
After:  ('interpolate', 0, 4)
Before:  ('interpolate', 0, 1)
After:  ('interpolate', 0, 0)
Before:  ('interpolate', 0, 1)
After:  ('interpolate', 0, 0)
Before:  ('interpolate', 0, 12)
After:  ('interpolate', 0, 4)
Before:  ('interpolate', 0, 1)
After:  ('interpolate', 0, 0)
Before:  ('interpolate', 0, 1)
After:  ('interpolate', 0, 0)
Before:  ('interpolate', 0, 7)
After:  ('interpolate', 0, 4)
Before:  ('interpolate', 0, 1)
After:  ('interpolate', 0, 0)
Before:  ('interpolate', 0, 1)
After:  ('interpolate', 0, 0)
Before:  ('interpolate', 0, 7)
After:  ('interpolate', 0, 4)
Before:  ('interpolate', 0, 1)
After:  ('interpolate', 0, 0)
Before:  ('interpolate', 0, 1)
After:  ('interpolate', 0, 0)
Before:  ('interpolate', 0, 7)
After:  ('interpolate', 0, 4)
Before:  ('interpolate', 0, 1)
After:  ('interpolate', 0, 0)
Before:  ('interpolate', 0, 17)
After:  ('interpolate', 0, 4)
Before:  ('interpolate', 0, 1)
After:  ('interpolate', 0, 0)
Before:  ('interpolate', 0, 1)
After:  ('interpolate', 0, 0)
Before:  ('interpolate', 0, 17)
After:  ('interpolate', 0, 4)
Before:  ('interpolate', 0, 1)
After:  ('interpolate', 0, 0)
Before:  ('interpolate', 0, 1)
After:  ('interpolate', 0, 0)
Before:  ('interpolate', 0, 17)
After:  ('interpolate', 0, 4)
Before:  ('interpolate', 0, 1)
After:  ('interpolate', 0, 0)
Before:  ('interpolate', 0, 1)
After:  ('interpolate', 0, 0)
Before:  ('interpolate', 0, 1)
After:  ('interpolate', 0, 0)
Before:  ('interpolate', 1, 21)
After:  ('interpolate', 1, 18)
Before:  ('interpolate', 0, 3)
After:  ('interpolate', 0, 2)
Before:  ('interpolate', 1, 21)
After:  ('interpolate', 1, 20)
Before:  ('interpolate', 0, 3)
After:  ('interpolate', 0, 2)
Before:  ('interpolate', 1, 13)
After:  ('interpolate', 1, 11)
Before:  ('interpolate', 0, 1)
After:  ('interpolate', 0, 0)
Before:  ('interpolate', 1, 15)
After:  ('interpolate', 1, 13)
Before:  ('interpolate', 0, 3)
After:  ('interpolate', 0, 2)
Before:  ('interpolate', 1, 25)
After:  ('interpolate', 1, 21)
Before:  ('interpolate', 0, 1)
After:  ('interpolate', 0, 0)
Before:  ('interpolate', 1, 27)
After:  ('interpolate', 1, 23)
Before:  ('interpolate', 0, 3)
After:  ('interpolate', 0, 2)
Before:  ('test_nn_BatchNorm1d_affine', 2, 3)
After:  ('test_nn_BatchNorm1d_affine', 1, 2)
Before:  ('test_nn_BatchNorm1d_3d_input', 2, 3)
After:  ('test_nn_BatchNorm1d_3d_input', 1, 2)
Before:  ('test_nn_BatchNorm1d_affine_simple_average', 2, 5)
After:  ('test_nn_BatchNorm1d_affine_simple_average', 1, 4)
Before:  ('test_nn_BatchNorm1d_not_affine', 2, 3)
After:  ('test_nn_BatchNorm1d_not_affine', 1, 2)
Before:  ('test_nn_BatchNorm1d_3d_input_not_affine', 2, 3)
After:  ('test_nn_BatchNorm1d_3d_input_not_affine', 1, 2)
Before:  ('test_nn_BatchNorm1d_zero_batch', 2, 3)
After:  ('test_nn_BatchNorm1d_zero_batch', 1, 2)
Before:  ('test_nn_BatchNorm2d', 2, 3)
After:  ('test_nn_BatchNorm2d', 1, 2)
Before:  ('test_nn_BatchNorm2d_2d_simple_average', 2, 5)
After:  ('test_nn_BatchNorm2d_2d_simple_average', 1, 4)
Before:  ('test_nn_BatchNorm2d_momentum', 2, 3)
After:  ('test_nn_BatchNorm2d_momentum', 1, 2)
Before:  ('test_nn_BatchNorm2d_not_affine', 2, 3)
After:  ('test_nn_BatchNorm2d_not_affine', 1, 2)
Before:  ('test_nn_BatchNorm2d_zero_batch', 2, 3)
After:  ('test_nn_BatchNorm2d_zero_batch', 1, 2)
Before:  ('test_nn_BatchNorm3d', 2, 3)
After:  ('test_nn_BatchNorm3d', 1, 2)
Before:  ('test_nn_BatchNorm3d_3d_simple_average', 2, 5)
After:  ('test_nn_BatchNorm3d_3d_simple_average', 1, 4)
Before:  ('test_nn_BatchNorm3d_momentum', 2, 3)
After:  ('test_nn_BatchNorm3d_momentum', 1, 2)
Before:  ('test_nn_BatchNorm3d_not_affine', 2, 3)
After:  ('test_nn_BatchNorm3d_not_affine', 1, 2)
Before:  ('test_nn_BatchNorm3d_zero_batch', 2, 3)
After:  ('test_nn_BatchNorm3d_zero_batch', 1, 2)
Before:  ('test_nn_Transformer', 127, 467)
After:  ('test_nn_Transformer', 122, 450)
```

Test Plan: Imported from OSS

Differential Revision: D21215652

Pulled By: eellison

fbshipit-source-id: 0365fc2e351caca7e1ccaa25428908a26e3f5343
2020-04-28 23:28:18 -07:00
45e8451b33 optimize is_float_point calls (#37012)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37012

Removes an if statement in `torch.nn.functional.affine_grid`

Test Plan: Imported from OSS

Differential Revision: D21160755

Pulled By: eellison

fbshipit-source-id: 8b030936c9fbdb05b44abc9f254805d102f2acc2
2020-04-28 23:28:12 -07:00
cde1350a5d Add support for generic list constants (#36953)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36953

Add support for generic lists as a constant. generic dicts & tuples are already implemented. This is a pretty common pattern and cuts down on the number of non-tensor nodes executed in interpolate tests.

Test Plan: Imported from OSS

Differential Revision: D21160761

Pulled By: eellison

fbshipit-source-id: 1e6b7b25b7580f09067794772d44e615601c60c4
2020-04-28 23:28:07 -07:00
c516f84525 [JIT] Add Lower Tuples Call & Run remove mutation after list unrolling (#36829)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36829

This changes the IR complexity from the previous PR for the following tests:
```
('Name', 'Ifs/Loops', 'non-tensor ops')
Before:  ('max_unpool1d', 0, 3)
After:  ('max_unpool1d', 0, 0)
Before:  ('max_unpool2d', 0, 3)
After:  ('max_unpool2d', 0, 0)
Before:  ('max_unpool3d', 0, 4)
After:  ('max_unpool3d', 0, 0)
Before:  ('adaptive_max_pool2d', 0, 3)
After:  ('adaptive_max_pool2d', 0, 0)
Before:  ('adaptive_max_pool3d', 0, 4)
After:  ('adaptive_max_pool3d', 0, 0)
Before:  ('adaptive_avg_pool2d', 0, 3)
After:  ('adaptive_avg_pool2d', 0, 0)
Before:  ('adaptive_avg_pool3d', 0, 4)
After:  ('adaptive_avg_pool3d', 0, 0)
Before:  ('upsample', 13, 68)
After:  ('upsample', 4, 28)
Before:  ('upsample', 13, 68)
After:  ('upsample', 0, 5)
Before:  ('interpolate', 14, 68)
After:  ('interpolate', 0, 4)
Before:  ('interpolate', 13, 67)
After:  ('interpolate', 4, 27)
Before:  ('interpolate', 14, 68)
After:  ('interpolate', 0, 4)
Before:  ('interpolate', 14, 68)
After:  ('interpolate', 0, 4)
Before:  ('interpolate', 13, 67)
After:  ('interpolate', 4, 27)
Before:  ('interpolate', 14, 68)
After:  ('interpolate', 0, 4)
Before:  ('interpolate', 14, 68)
After:  ('interpolate', 0, 4)
Before:  ('interpolate', 13, 67)
After:  ('interpolate', 4, 27)
Before:  ('interpolate', 14, 68)
After:  ('interpolate', 0, 4)
Before:  ('interpolate', 14, 68)
After:  ('interpolate', 0, 4)
Before:  ('interpolate', 13, 67)
After:  ('interpolate', 4, 27)
Before:  ('interpolate', 14, 68)
After:  ('interpolate', 0, 4)
Before:  ('interpolate', 14, 59)
After:  ('interpolate', 0, 3)
Before:  ('interpolate', 13, 57)
After:  ('interpolate', 4, 21)
Before:  ('interpolate', 14, 59)
After:  ('interpolate', 0, 3)
Before:  ('interpolate', 14, 59)
After:  ('interpolate', 0, 3)
Before:  ('interpolate', 13, 57)
After:  ('interpolate', 4, 21)
Before:  ('interpolate', 14, 59)
After:  ('interpolate', 0, 3)
Before:  ('interpolate', 14, 59)
After:  ('interpolate', 0, 3)
Before:  ('interpolate', 13, 57)
After:  ('interpolate', 4, 21)
Before:  ('interpolate', 14, 59)
After:  ('interpolate', 0, 3)
Before:  ('interpolate', 13, 77)
After:  ('interpolate', 4, 33)
Before:  ('interpolate', 14, 77)
After:  ('interpolate', 0, 5)
Before:  ('interpolate', 14, 77)
After:  ('interpolate', 0, 5)
Before:  ('interpolate', 13, 77)
After:  ('interpolate', 4, 33)
Before:  ('interpolate', 14, 77)
After:  ('interpolate', 0, 5)
Before:  ('interpolate', 14, 77)
After:  ('interpolate', 0, 5)
Before:  ('interpolate', 13, 77)
After:  ('interpolate', 4, 33)
Before:  ('interpolate', 14, 77)
After:  ('interpolate', 0, 5)
Before:  ('interpolate', 14, 68)
After:  ('interpolate', 0, 4)
Before:  ('interpolate', 14, 68)
After:  ('interpolate', 0, 4)
Before:  ('interpolate', 15, 103)
After:  ('interpolate', 1, 23)
Before:  ('interpolate', 14, 70)
After:  ('interpolate', 0, 6)
Before:  ('interpolate', 15, 103)
After:  ('interpolate', 1, 21)
Before:  ('interpolate', 14, 70)
After:  ('interpolate', 0, 6)
Before:  ('interpolate', 15, 91)
After:  ('interpolate', 1, 13)
Before:  ('interpolate', 14, 59)
After:  ('interpolate', 0, 3)
Before:  ('interpolate', 15, 93)
After:  ('interpolate', 1, 16)
Before:  ('interpolate', 14, 61)
After:  ('interpolate', 0, 5)
Before:  ('interpolate', 15, 111)
After:  ('interpolate', 1, 28)
Before:  ('interpolate', 14, 77)
After:  ('interpolate', 0, 5)
Before:  ('interpolate', 15, 113)
After:  ('interpolate', 1, 27)
Before:  ('interpolate', 14, 79)
After:  ('interpolate', 0, 7)
Before:  ('test_nn_AdaptiveMaxPool2d_single', 0, 3)
After:  ('test_nn_AdaptiveMaxPool2d_single', 0, 0)
Before:  ('test_nn_AdaptiveMaxPool2d_tuple', 0, 3)
After:  ('test_nn_AdaptiveMaxPool2d_tuple', 0, 0)
Before:  ('test_nn_AdaptiveMaxPool3d_single', 0, 4)
After:  ('test_nn_AdaptiveMaxPool3d_single', 0, 0)
Before:  ('test_nn_AdaptiveMaxPool3d_tuple', 0, 4)
After:  ('test_nn_AdaptiveMaxPool3d_tuple', 0, 0)
Before:  ('test_nn_AdaptiveMaxPool3d_single_nonatomic', 0, 4)
After:  ('test_nn_AdaptiveMaxPool3d_single_nonatomic', 0, 0)
Before:  ('test_nn_AdaptiveMaxPool3d_tuple_nonatomic', 0, 4)
After:  ('test_nn_AdaptiveMaxPool3d_tuple_nonatomic', 0, 0)
Before:  ('test_nn_AdaptiveAvgPool2d_single', 0, 3)
After:  ('test_nn_AdaptiveAvgPool2d_single', 0, 0)
Before:  ('test_nn_AdaptiveAvgPool2d_single_1x1output', 0, 3)
After:  ('test_nn_AdaptiveAvgPool2d_single_1x1output', 0, 0)
Before:  ('test_nn_AdaptiveAvgPool2d_tuple', 0, 3)
After:  ('test_nn_AdaptiveAvgPool2d_tuple', 0, 0)
Before:  ('test_nn_AdaptiveAvgPool3d_single', 0, 4)
After:  ('test_nn_AdaptiveAvgPool3d_single', 0, 0)
Before:  ('test_nn_AdaptiveAvgPool3d_tuple', 0, 4)
After:  ('test_nn_AdaptiveAvgPool3d_tuple', 0, 0)
```

Test Plan: Imported from OSS

Differential Revision: D21160758

Pulled By: eellison

fbshipit-source-id: 68ccbf3af74398e8dbad7e6bedb639635dafdb2e
2020-04-28 23:28:02 -07:00
cdc0880632 add post unroll optimizations (#36828)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36828

This changes ir complexity for the following:

```
("Name", "Ifs/Loops", "non-tensor ops")
Before:  ('max_unpool1d', 0, 12)
After:  ('max_unpool1d', 0, 3)
Before:  ('max_unpool2d', 0, 22)
After:  ('max_unpool2d', 0, 3)
Before:  ('max_unpool3d', 0, 33)
After:  ('max_unpool3d', 0, 4)
Before:  ('adaptive_max_pool2d', 0, 6)
After:  ('adaptive_max_pool2d', 0, 3)
Before:  ('adaptive_max_pool3d', 0, 9)
After:  ('adaptive_max_pool3d', 0, 4)
Before:  ('adaptive_avg_pool2d', 0, 6)
After:  ('adaptive_avg_pool2d', 0, 3)
Before:  ('adaptive_avg_pool3d', 0, 9)
After:  ('adaptive_avg_pool3d', 0, 4)
Before:  ('instance_norm', 1, 6)
After:  ('instance_norm', 0, 0)
Before:  ('group_norm', 1, 6)
After:  ('group_norm', 0, 0)
Before:  ('upsample', 13, 71)
After:  ('upsample', 13, 68)
Before:  ('upsample', 13, 71)
After:  ('upsample', 13, 68)
Before:  ('interpolate', 14, 71)
After:  ('interpolate', 14, 68)
Before:  ('interpolate', 13, 70)
After:  ('interpolate', 13, 67)
Before:  ('interpolate', 14, 71)
After:  ('interpolate', 14, 68)
Before:  ('interpolate', 14, 71)
After:  ('interpolate', 14, 68)
Before:  ('interpolate', 13, 70)
After:  ('interpolate', 13, 67)
Before:  ('interpolate', 14, 71)
After:  ('interpolate', 14, 68)
Before:  ('interpolate', 14, 71)
After:  ('interpolate', 14, 68)
Before:  ('interpolate', 13, 70)
After:  ('interpolate', 13, 67)
Before:  ('interpolate', 14, 71)
After:  ('interpolate', 14, 68)
Before:  ('interpolate', 14, 71)
After:  ('interpolate', 14, 68)
Before:  ('interpolate', 13, 70)
After:  ('interpolate', 13, 67)
Before:  ('interpolate', 14, 71)
After:  ('interpolate', 14, 68)
Before:  ('interpolate', 14, 60)
After:  ('interpolate', 14, 59)
Before:  ('interpolate', 13, 58)
After:  ('interpolate', 13, 57)
Before:  ('interpolate', 14, 60)
After:  ('interpolate', 14, 59)
Before:  ('interpolate', 14, 60)
After:  ('interpolate', 14, 59)
Before:  ('interpolate', 13, 58)
After:  ('interpolate', 13, 57)
Before:  ('interpolate', 14, 60)
After:  ('interpolate', 14, 59)
Before:  ('interpolate', 14, 60)
After:  ('interpolate', 14, 59)
Before:  ('interpolate', 13, 58)
After:  ('interpolate', 13, 57)
Before:  ('interpolate', 14, 60)
After:  ('interpolate', 14, 59)
Before:  ('interpolate', 13, 82)
After:  ('interpolate', 13, 77)
Before:  ('interpolate', 14, 82)
After:  ('interpolate', 14, 77)
Before:  ('interpolate', 14, 82)
After:  ('interpolate', 14, 77)
Before:  ('interpolate', 13, 82)
After:  ('interpolate', 13, 77)
Before:  ('interpolate', 14, 82)
After:  ('interpolate', 14, 77)
Before:  ('interpolate', 14, 82)
After:  ('interpolate', 14, 77)
Before:  ('interpolate', 13, 82)
After:  ('interpolate', 13, 77)
Before:  ('interpolate', 14, 82)
After:  ('interpolate', 14, 77)
Before:  ('interpolate', 14, 71)
After:  ('interpolate', 14, 68)
Before:  ('interpolate', 14, 71)
After:  ('interpolate', 14, 68)
Before:  ('interpolate', 15, 106)
After:  ('interpolate', 15, 103)
Before:  ('interpolate', 14, 73)
After:  ('interpolate', 14, 70)
Before:  ('interpolate', 15, 106)
After:  ('interpolate', 15, 103)
Before:  ('interpolate', 14, 73)
After:  ('interpolate', 14, 70)
Before:  ('interpolate', 15, 92)
After:  ('interpolate', 15, 91)
Before:  ('interpolate', 14, 60)
After:  ('interpolate', 14, 59)
Before:  ('interpolate', 15, 94)
After:  ('interpolate', 15, 93)
Before:  ('interpolate', 14, 62)
After:  ('interpolate', 14, 61)
Before:  ('interpolate', 15, 116)
After:  ('interpolate', 15, 111)
Before:  ('interpolate', 14, 82)
After:  ('interpolate', 14, 77)
Before:  ('interpolate', 15, 118)
After:  ('interpolate', 15, 113)
Before:  ('interpolate', 14, 84)
After:  ('interpolate', 14, 79)
Before:  ('test_nn_BatchNorm1d_3d_input', 3, 9)
After:  ('test_nn_BatchNorm1d_3d_input', 2, 3)
Before:  ('test_nn_BatchNorm1d_3d_input_not_affine', 3, 9)
After:  ('test_nn_BatchNorm1d_3d_input_not_affine', 2, 3)
Before:  ('test_nn_BatchNorm1d_zero_batch', 3, 9)
After:  ('test_nn_BatchNorm1d_zero_batch', 2, 3)
Before:  ('test_nn_BatchNorm2d', 3, 13)
After:  ('test_nn_BatchNorm2d', 2, 3)
Before:  ('test_nn_BatchNorm2d_2d_simple_average', 3, 15)
After:  ('test_nn_BatchNorm2d_2d_simple_average', 2, 5)
Before:  ('test_nn_BatchNorm2d_momentum', 3, 13)
After:  ('test_nn_BatchNorm2d_momentum', 2, 3)
Before:  ('test_nn_BatchNorm2d_not_affine', 3, 13)
After:  ('test_nn_BatchNorm2d_not_affine', 2, 3)
Before:  ('test_nn_BatchNorm2d_not_tracking_stats', 1, 10)
After:  ('test_nn_BatchNorm2d_not_tracking_stats', 0, 0)
Before:  ('test_nn_BatchNorm2d_zero_batch', 3, 13)
After:  ('test_nn_BatchNorm2d_zero_batch', 2, 3)
Before:  ('test_nn_BatchNorm3d', 3, 17)
After:  ('test_nn_BatchNorm3d', 2, 3)
Before:  ('test_nn_BatchNorm3d_3d_simple_average', 3, 19)
After:  ('test_nn_BatchNorm3d_3d_simple_average', 2, 5)
Before:  ('test_nn_BatchNorm3d_momentum', 3, 17)
After:  ('test_nn_BatchNorm3d_momentum', 2, 3)
Before:  ('test_nn_BatchNorm3d_not_affine', 3, 17)
After:  ('test_nn_BatchNorm3d_not_affine', 2, 3)
Before:  ('test_nn_BatchNorm3d_not_tracking_stats', 1, 14)
After:  ('test_nn_BatchNorm3d_not_tracking_stats', 0, 0)
Before:  ('test_nn_BatchNorm3d_zero_batch', 3, 17)
After:  ('test_nn_BatchNorm3d_zero_batch', 2, 3)
Before:  ('test_nn_InstanceNorm1d', 1, 6)
After:  ('test_nn_InstanceNorm1d', 0, 0)
Before:  ('test_nn_InstanceNorm1d_tracking_stats', 1, 6)
After:  ('test_nn_InstanceNorm1d_tracking_stats', 0, 0)
Before:  ('test_nn_InstanceNorm2d', 1, 10)
After:  ('test_nn_InstanceNorm2d', 0, 0)
Before:  ('test_nn_InstanceNorm2d_tracking_stats', 1, 10)
After:  ('test_nn_InstanceNorm2d_tracking_stats', 0, 0)
Before:  ('test_nn_InstanceNorm3d', 1, 14)
After:  ('test_nn_InstanceNorm3d', 0, 0)
Before:  ('test_nn_InstanceNorm3d_tracking_stats', 1, 14)
After:  ('test_nn_InstanceNorm3d_tracking_stats', 0, 0)
Before:  ('test_nn_GroupNorm_1d_affine', 1, 6)
After:  ('test_nn_GroupNorm_1d_affine', 0, 0)
Before:  ('test_nn_GroupNorm_1d_no_affine_IN', 1, 6)
After:  ('test_nn_GroupNorm_1d_no_affine_IN', 0, 0)
Before:  ('test_nn_GroupNorm_1d_no_affine_LN', 1, 6)
After:  ('test_nn_GroupNorm_1d_no_affine_LN', 0, 0)
Before:  ('test_nn_GroupNorm_2d_affine', 1, 10)
After:  ('test_nn_GroupNorm_2d_affine', 0, 0)
Before:  ('test_nn_GroupNorm_2d_no_affine_IN', 1, 10)
After:  ('test_nn_GroupNorm_2d_no_affine_IN', 0, 0)
Before:  ('test_nn_GroupNorm_2d_no_affine_LN', 1, 10)
After:  ('test_nn_GroupNorm_2d_no_affine_LN', 0, 0)
Before:  ('test_nn_AdaptiveMaxPool2d_single', 0, 6)
After:  ('test_nn_AdaptiveMaxPool2d_single', 0, 3)
Before:  ('test_nn_AdaptiveMaxPool2d_tuple', 0, 6)
After:  ('test_nn_AdaptiveMaxPool2d_tuple', 0, 3)
Before:  ('test_nn_AdaptiveMaxPool3d_single', 0, 9)
After:  ('test_nn_AdaptiveMaxPool3d_single', 0, 4)
Before:  ('test_nn_AdaptiveMaxPool3d_tuple', 0, 9)
After:  ('test_nn_AdaptiveMaxPool3d_tuple', 0, 4)
Before:  ('test_nn_AdaptiveMaxPool3d_single_nonatomic', 0, 9)
After:  ('test_nn_AdaptiveMaxPool3d_single_nonatomic', 0, 4)
Before:  ('test_nn_AdaptiveMaxPool3d_tuple_nonatomic', 0, 9)
After:  ('test_nn_AdaptiveMaxPool3d_tuple_nonatomic', 0, 4)
Before:  ('test_nn_AdaptiveAvgPool2d_single', 0, 6)
After:  ('test_nn_AdaptiveAvgPool2d_single', 0, 3)
Before:  ('test_nn_AdaptiveAvgPool2d_single_1x1output', 0, 6)
After:  ('test_nn_AdaptiveAvgPool2d_single_1x1output', 0, 3)
Before:  ('test_nn_AdaptiveAvgPool2d_tuple', 0, 6)
After:  ('test_nn_AdaptiveAvgPool2d_tuple', 0, 3)
Before:  ('test_nn_AdaptiveAvgPool3d_single', 0, 9)
After:  ('test_nn_AdaptiveAvgPool3d_single', 0, 4)
Before:  ('test_nn_AdaptiveAvgPool3d_tuple', 0, 9)
After:  ('test_nn_AdaptiveAvgPool3d_tuple', 0, 4)
```

Test Plan: Imported from OSS

Differential Revision: D21160759

Pulled By: eellison

fbshipit-source-id: 91ca6ef2269ee364ca354c8d0843847744145d25
2020-04-28 23:27:57 -07:00
92129956cf Add size peephole optimziation (#36758)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/36758

Test Plan: Imported from OSS

Differential Revision: D21160760

Pulled By: eellison

fbshipit-source-id: 9cdb8eeffa71fb4670a811347ae4fad2a82ae1d8
2020-04-28 23:27:52 -07:00
0c3a6f941f disable peephole optimizations that require alias db (#36757)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36757

replacing x + 0 with x isn't that much of a speedup, and is an optimization also duplicated at the Tensor Expr level. Constructing an alias db is costly and it's not worth rebuilding an alias db each time we optimize out x + 0.

Test Plan: Imported from OSS

Differential Revision: D21160757

Pulled By: eellison

fbshipit-source-id: 9b3d4fa430b838898fe6c78660ec3c608547bb31
2020-04-28 23:26:33 -07:00
4e3dc34c47 add complex support to reciprocal_cuda kernel (#36749)
Summary:
dylanbespalko anjali411

Not sure if the test should be added to `test_torch` or `test_complex`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36749

Differential Revision: D21290529

Pulled By: anjali411

fbshipit-source-id: 07bc282e4c9480cd015ec5db104e79728437cd90
2020-04-28 21:51:46 -07:00
fd4a09ea73 [WIP] Bind in CellParams for RNN (#35787)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/35787

Test Plan: Imported from OSS

Differential Revision: D20784118

Pulled By: jamesr66a

fbshipit-source-id: 5d8f7e1502f707bff9a9aefa90e3edfb3429549b
2020-04-28 21:47:06 -07:00
74c00b1f69 move to explicit avx2 switching (#37207)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37207

The main idea here is to try and give build system more flexibility on when various AVX instruction are defined, previously it was based solely on compiler defined preprocessor flags.

Here we re-use `CPU_CAPABILITY` which already needs to be defined for each pass in `["DEFAULT", "AVX", "AVX2"]` over the source files.

To give a slightly more concrete reason this is needed, is that we have not found a way to override `/arch` flags previously specified on the command line from visual studio (causing us to duplicate symbols in some cases).

Test Plan: CI green

Differential Revision: D21218512

fbshipit-source-id: f628153f5f3d83cd6bd4a5283fb0dc751a58ebf9
2020-04-28 21:45:34 -07:00
21b7af1e7b allow inplace leaky_relu backward calc when slope == 0 (#37453)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37453

to fix (#37345)

Test Plan: Imported from OSS

Differential Revision: D21290911

Pulled By: glaringlee

fbshipit-source-id: 81677e9e195298bc1bde82b77c51f52d58aa5422
2020-04-28 21:42:33 -07:00
facdd15cc6 [quant] Finishing refactor for quantization test files (#37366)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37366

- we can put both fake quant module and observer module tests in the test_workflow_module.py
- added test_quantized_functional.py
- moved tests in test_numerics.py to test_quantize.py and removed test_numerics.py

Test Plan:
python test/test_quantization.py

Imported from OSS

Differential Revision: D21282198

fbshipit-source-id: 60107cee7d1ed2cd14a45650e91ec28b8a262c52
2020-04-28 21:40:57 -07:00
e69115ec52 [quant][graph] Add JIT passes for dynamic quant multi uses of quant node (#37125)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37125

For dynamic quant we need to replicate the choose_qparams and quantize function in addition to replicating dequant.
RemoveRedundantQuantizeOps pass checks for the choose_qparams - quant - dequant pattern in the graph and removes it if the node following it cannot be quantized using dynamic quantization.

Test Plan:
python test_quantize_script.py test_dynamic_quant_multi_uses

Imported from OSS

Differential Revision: D21283697

fbshipit-source-id: 70fa0abdaeb2cc2935149a941d93a7e8b28d61d3
2020-04-28 21:36:55 -07:00
3b9ddab093 [quant][graph] Run dynamic quantization for specific ops (#37093)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37093

Specify which ops should/can be dynamically quantized. Similar to static quantization

Test Plan:
python test_quantize_script.py test_dynamic_multi_op

Imported from OSS

Differential Revision: D21283695

fbshipit-source-id: 7ee238940c5c239f6ef8af994655e0b13db64161
2020-04-28 21:36:50 -07:00
9dab3ed5c6 [graph][quant] Enable accessing child/grandchild modules in forward (#37045)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37045

Fixes to get the correct path for child modules

Test Plan: Imported from OSS

Differential Revision: D21283698

fbshipit-source-id: 48a7f7762df86a5177ea117ab0cd7cb1d6e6209d
2020-04-28 21:36:46 -07:00
e55d2e6fa6 [quant][graph] Add check for qconfig_dict key (#37014)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37014

User should only pass name as key in dict.

Test Plan: Imported from OSS

Differential Revision: D21283696

fbshipit-source-id: e6babbe9302c812d6ae03ed7f843d2816b752e78
2020-04-28 21:35:17 -07:00
92b9089fd9 [jit] Fix pretty printing of functions (#37432)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37432

Fixes https://github.com/pytorch/pytorch/issues/36803.

Test Plan: Imported from OSS

Differential Revision: D21284735

Pulled By: suo

fbshipit-source-id: 8c673099b3171070bff80fd1defc91487f66d4b3
2020-04-28 21:30:49 -07:00
07bb442b24 Move DistributonTemplates to anonymous namespace (#37429)
Summary:
All templates which are included from `ATen/native/cpu` must be in anonymous namespace, especially if they are using instruction set extensions but do not support dynamic dispatching.

Otherwise, linker is free to pick AVX2, AVX or DEFAULT version of instantiated templates during final linking stage

Test Plan; Apply on top of https://github.com/pytorch/pytorch/pull/37121 and make sure that `basic` test successfully finishes on CircleCI MacPro (that does not support AVX2), but `ATEN_CPU_CAPABILITY=avx2 ./basic --gtest_filter=*HalfCPU` crashes with illegal instruction
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37429

Differential Revision: D21294818

Pulled By: malfet

fbshipit-source-id: ab32b8553de225d2f672fac2f48591682bd7dec4
2020-04-28 20:20:54 -07:00
12f5a32863 Don't use NonVariableTypeMode in custom ops (#37355)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37355

Potentially fixes https://github.com/pytorch/pytorch/issues/37306
ghstack-source-id: 103073537

Test Plan: waitforsandcastle

Differential Revision: D21261946

fbshipit-source-id: 454652b528dcf942bec5438f89201822de40bbf0
2020-04-28 20:11:31 -07:00
edc5ef1afb run the simple executor for jit tests by default, add profiling jobs … (#37017)
Summary:
…for fusion tests

fix flake8 warnings

fix ci failures

fix test_determination.py
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37017

Differential Revision: D21238446

Pulled By: Krovatkin

fbshipit-source-id: 393e6135883dc5ac57bdff580de96c66829d454c
2020-04-28 19:16:52 -07:00
6fa76b8a0c [jit] __deepcopy__ for RecursiveScriptModule (#32684)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32684

Previously we have `clone` and `clone_instance`, where `clone` will clone both type
and value, and `clone_instance` only clone the value, both of them are shallow copies.
We need to re-evaluate whether we should expose them as a user facing API.
I think we should hide `clone`, but `clone_instance` might be useful as well, especially
when we are copying a model with very large weights, people might just want to do shallow copy.

This PR adds a `deepcopy` that might be useful as a user API, which deep copies the values, including
Tensor, but we didn't deepcopy `Blob`, `Capsule`, `Future` or `PyObject`.
For more discussions please see the following issue.

fixes: https://github.com/pytorch/pytorch/issues/32519

Test Plan: Imported from OSS

Differential Revision: D21220756

fbshipit-source-id: 476bf11fe82c08fac36e7457879a09f545ffdc5e
2020-04-28 18:47:11 -07:00
e5a24a6389 Retry anaconda upload (#37414)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/37414

Differential Revision: D21294546

Pulled By: seemethere

fbshipit-source-id: f9ee6211a0cd1b4f809ac6d3acfebfb74fbe8a2b
2020-04-28 18:24:45 -07:00
273c464145 Fix TensorIterator::view_offsets_ size (#37214)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/37084

There are 3 alternatives for this design.

This PR and the first one.
When a tensor is a scalar `ndim==0`, accessing view_offsets_[0] when doing reductions, yields an invalid offset for the index which is the output of `argmax` and `argmin`.

fba9b9a023/aten/src/ATen/native/cpu/Reduce.h (L217)

This also happens in cuda code:
fba9b9a023/aten/src/ATen/native/cuda/Reduce.cuh (L797)

The second alternative is to check the size of `view_offsets` before accessing it. But this introduces some burden.

The third alternative is related to the way that inputs are treated in `argmax` and `argmin`
depending on the `dim` argument value.

fba9b9a023/aten/src/ATen/native/ReduceOps.cpp (L775-L780)

If `dim` is not specified, then the scalar gets reshaped into a 1-dim tensor and everything works properly, since now `view_offsets` has an actual entry.
If dim is specified, then the input remains as a scalar causing the issue we see here.

This PR tries to solve it in a generic way for every case so I went with option 1. I am willing to discuss it and change if you think that the other alternatives are better.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37214

Differential Revision: D21258320

Pulled By: ngimel

fbshipit-source-id: 46223412187bbba4bfa7337e3f1d2518db72dea2
2020-04-28 18:08:51 -07:00
dcd8a1b399 Revert D21286660: [quant] Generalizing _calculate_dynamic_qparams in quantized test
Test Plan: revert-hammer

Differential Revision:
D21286660

Original commit changeset: 98d90cdb34ac

fbshipit-source-id: a4194193c9aa53fb2dc9bbc04fde9c2925aa378f
2020-04-28 18:01:44 -07:00
6c0f447b51 Remove ONNX BatchNorm(12) test and converter. (#37309)
Summary:
Pursuant to https://github.com/onnx/onnx/pull/2750 we must remove PyTorch ONNX exporter related changes to BatchNorm(12) that were introduced as part of https://github.com/pytorch/pytorch/pull/35567. This change is also needed to unblock ONNX [BUILD CI failures](https://circleci.com/gh/onnx/onnx/4629?utm_campaign=vcs-integration-link&utm_medium=referral&utm_source=github-build-link) caused by PyTorch/Caffe2 tests.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37309

Reviewed By: hl475

Differential Revision: D21288914

Pulled By: houseroad

fbshipit-source-id: 15b076a2af55918dcd57f4e2fc77accd3d1510bd
2020-04-28 17:45:01 -07:00
8258d42bd0 [pytorch] add '__BASE__' section to op deps to factor out frequently used util ops (#37404)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37404

Many aten operators are really like util functions, e.g.:
aten::is_nonzero, aten::is_floating_point, etc. These ops can be called
via overloaded c++ operator, so seemingly trivial and innocent code changes can
affect how these ops are used by other ops (thus changes the output of
static analyzer).

Most of these util ops are rather small in terms of build size cost, so
for the purpose of optimizing binary size with custom build, whether to
include these ops or not does not make significant difference. In fact
for non-trivial models a set of these ops are almost always used.

This PR introduced the (optional) '__BASE__' ops section to the dependency graph.

We can maintain the list of frequently used small util ops for internal BUCK
build. This way, the output dependency graph will only contain meaningful
edges with significant binary size impact, and it will be more stable from
trivial code changes (which is checked in FB codebase).

Having a stable and sparse deps graph by factoring out frequently used based ops
is also a nice property to allow us to explore alternative custom build
solutions in case we find it hard to maintain the static code analyzer.

Test Plan: Imported from OSS

Differential Revision: D21280835

Pulled By: ljk53

fbshipit-source-id: c4d0d1f07ca868c60f23118d877fc1eeead4c875
2020-04-28 17:18:09 -07:00
e0a5b443d6 [pytorch] remove unused flags from code analyzer & move format support to python (#37393)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37393

Simplify the code analyzer by removing some unused flags and moving the
different format printer logic to python script. It's easier to add other
post processing logic to adapt to different BUCK build configs.

Test Plan: Imported from OSS

Differential Revision: D21280836

Pulled By: ljk53

fbshipit-source-id: 0d66d5891d850f012c4ab4f39eabbd9aecc1caa9
2020-04-28 17:16:55 -07:00
239ce75a74 [quant] Generalizing _calculate_dynamic_qparams in quantized test (#37451)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/37451

Test Plan: Imported from OSS

Differential Revision: D21286660

Pulled By: z-a-f

fbshipit-source-id: 98d90cdb34ac3d0ef33f7ebe1c9f32001d4e80b6
2020-04-28 16:55:51 -07:00
024f663fc1 Resubmit "Fix NaN error in dynamic quantization in qLinear, re-enable test_quantized_rnn"" (#37458)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37458

Original commit changeset: 0204c360ef2c
ghstack-source-id: 103069356

Test Plan: buck test caffe2/test:quantization -- 'test_quantized_rnn \(quantization\.test_quantization\.PostTrainingDynamicQuantTest\)' --print-passing-details --run-disabled

Reviewed By: jamesr66a

Differential Revision: D21287904

fbshipit-source-id: a60deb8bdcf6af4258d1c1b199fa2a5a90318528
2020-04-28 16:18:15 -07:00
5b6f6da18c [caffe2] Copy tensor in single tensor input case in UnPackRecordsOp (#37454)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37454

Fix a bug introduced in D21224497.

In the case of having a single unpacked tensor as input, we still need to copy the underline memory because only inputs are guaranteed to be read-only. The output could be overwritten later during inference. If we share the tensor, we could potentially overwrite the input, which in principle should be read only.

Test Plan:
```
buck test caffe2/caffe2/python/operator_test:dataset_ops_test
```

AdIndexer canary:
https://our.intern.facebook.com/intern/ads/canary/426290361213982683

Reviewed By: yinghai

Differential Revision: D21274309

fbshipit-source-id: 71931d4b1afbdc700ba070ea618d1679f1bbe5a7
2020-04-28 15:37:11 -07:00
d1a39815f9 Remove Python 2 string compatibility in ATen/function_wrapper.py (#37388)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/37388

Differential Revision: D21285638

Pulled By: ezyang

fbshipit-source-id: 0cd524a5c000581d6fb3d1dd299191a4cbf19766
2020-04-28 14:23:50 -07:00
580928801f [ONNX] Adding 'numel' and 'to' export for script module (#36501)
Summary:
These two ops are needed for torchvision model export. Since we're scripting a part of the code for dynamic export of models (in https://github.com/pytorch/vision/pull/2052), these two changes are requited.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36501

Reviewed By: hl475

Differential Revision: D21260721

Pulled By: houseroad

fbshipit-source-id: 86d9d38665a4a36d22cec741012d976e5bd8d36b
2020-04-28 12:35:45 -07:00
a51f047c7e Synchronize MAGMA functions with the current CUDA stream (#36605)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/21821

This follows ngimel's [suggestion](https://github.com/pytorch/pytorch/issues/21821#issuecomment-502968982) to manually synchronize MAGMA calls with the current stream. This is handled automatically with `MagmaStreamSyncGuard`.

I think for the functions with `_batched` variants we could possibly avoid synchronisation by using a batch of size 1 since these have a `magma_queue_t` argument. However, I presume there's a reason it wasn't written like that in the first place.

I also figured out why porting to aten ["magically fixed"](https://github.com/pytorch/pytorch/issues/21821#issuecomment-527647971) `torch.svd`. The magma functions for svd all take host arrays as input and output. The ATen port uses blocking `copy_`s which fully synchronize the operation. On the other hand, the THC functions use `cudaMemcpy` which doesn't synchronize with streams created with `cudaStreamNonBlocking` (which `aten` does). The fix is to use `cudaMemcpyAsync` and `cudaStreamSynchronize`, the same as `copy_` does internally:
835ee34e38/aten/src/ATen/native/cuda/Copy.cu (L192-L193)

I'm not sure how to test these changes as I wasn't able to reproduce any of the stream sync issues. Possibly a mixture of non-determinism and because some of these functions are implicitly synchronous anyway.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36605

Differential Revision: D21258265

Pulled By: ngimel

fbshipit-source-id: 76d8f687c605e5e9cd68b97dc1d70a39a13376ec
2020-04-28 11:24:44 -07:00
d068a456d3 [resubmit] Enable global observers API (#37382)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37382

After adding c10::DispatchKey::Profiler the behavior of RecordFunction
observers is also controlled by the dispatch key,
this PR moves the logic outside of the profiler into the record function

Reviewed By: jamesr66a

Differential Revision: D21268320

fbshipit-source-id: 93207e3b55325d20dcc5b1e8f448ab86933321da
2020-04-28 10:49:31 -07:00
4234d62489 [hotfix] Workaround for older versions of ninja (#37417)
Summary:
Older versions of ninja don't like relative paths in configure_file when it is called twice.

https://gitlab.kitware.com/cmake/cmake/issues/17601

Fix suggested in comments https://gitlab.kitware.com/cmake/cmake/-/issues/18584
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37417

Reviewed By: malfet

Differential Revision: D21280141

Pulled By: bwasti

fbshipit-source-id: 4cb94996a9e8ae8c01602ea1da6f4ce9d61fa700
2020-04-28 09:03:51 -07:00
c5d6f59ab1 Replacing EHa with EHsc (#37235)
Summary:
We should not rely on the async exceptions. Catching C++ only exception is more sensible and may get a boost in both space (1163 MB -> 1073 MB, 0.92x) and performance(51m -> 49m, 0.96x).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37235

Differential Revision: D21256918

Pulled By: ezyang

fbshipit-source-id: 572ee96f2e4c48ad13f83409e4e113483b3a457a
2020-04-28 08:20:37 -07:00
8fe2a5e91b Fixes type annotations for named tensors #27846 (#36890)
Summary:
This enables type checking for named tensors, and fixes the underlying problems.

The bulk of the fix is modifying `gen_pyi.py` to generate reasonable types in `torch/__init__.pyi`.  I took two approaches:  First, I tried to take a generic approach and added `DimnameList` to the magic list of variable argument lists.  Unfortunately that was insufficient for many of the method signatures, so I also added manual definitions for `rename`, `refine_names`, and `unflatten` in `__init__.pyi.in`.

Finally there were a few problems in the doctests that had to be cleaned up so that `test/test_type_hints.py` will run successfully.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36890

Differential Revision: D21259192

Pulled By: zou3519

fbshipit-source-id: 2a9e7d7bec9be5ae3ae2995078c6abfa3eca103c
2020-04-28 06:51:22 -07:00
ebcacd5e87 [Bazel] Build ATen_CPU_AVX2 lib with AVX2 arch flags enabled (#37381)
Summary:
Make sleef dependency public so that `ATen_CPU_{capability}` libs can depend on it
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37381

Test Plan: CI

Differential Revision: D21273443

Pulled By: malfet

fbshipit-source-id: 7f756c7f3c605e51cf0c27ea37f687913cd48708
2020-04-27 22:49:37 -07:00
b37080d97a remove record_function_enter and record_function_exit from header (#37052)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37052

These only need to be in the cpp as they are not referenced anywhere
else. These functions should only be used from the python operators
torch.opts.profiler.record_function_{enter, exit}.
ghstack-source-id: 102979051

Test Plan: CI

Differential Revision: D21171987

fbshipit-source-id: dfe8130d2b64de6179222327069ce1ab877829e3
2020-04-27 21:21:34 -07:00
48b126f496 [caffe2] Fast path for single tensor in UnPackRecordsOp (#37361)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37361

Add a fast path for the case of batch_size = 1 and single ad embedding in UnPackRecordsOp. In this case, there is no need to pack the single tensor into a shared_ptr<vector<vector<Tensor>>> and then unpack it in UnPackRecordsOp. Instead, we can just pass the tensor as it is into UnPackRecordsOp and share the data with the output tensor.

Reviewed By: yinghai

Differential Revision: D21224497

fbshipit-source-id: 70685e5cc20ffdc5e0044a4b97a7fc5133786db4
2020-04-27 20:50:21 -07:00
da64ed14f6 Reduce volume of spammy warning (#37360)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37360

title

Test Plan: CI

Reviewed By: jamesr66a

Differential Revision: D21263320

fbshipit-source-id: 590da265ca1f7d4e19c28c47ffbad9b6af6fdc2f
2020-04-27 19:45:51 -07:00
4ff4119d45 [rpc] Move _set_rpc_backand and RpcBackendOptions to use float instead of timedelta (#37027)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37027

The RPC timeout passed into rpc_sync and rpc_async after the below
change is now float, so we should make these APIs consistent.
ghstack-source-id: 102971906

Test Plan:
Existing unittests, also added unittest testing specific timeout set
in ProcessGroupRpcBackendOptions and the dispatch rpc backend options handling.

Differential Revision: D21125171

fbshipit-source-id: a5894b8ce31d2926f2c3d323d1cda4d54b30cef1
2020-04-27 19:38:06 -07:00
5a59bbc1da [TensorExpr] IRPrinter: show output_args separate from reduce_args when printing ReduceOp. (#37367)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37367

Before this change we printed all the args in the same list, for example:
```
BEFORE RFACTOR:
{
  for (int m = 0; m < m_1; m++) {
    for (int n = 0; n < n_1; n++) {
      sum[0] = ReduceOp(sum, float(0), (sum[0]) + (b[m, n]), {m, n});
    }
  }
}
AFTER RFACTOR:
{
  for (int m = 0; m < m_1; m++) {
    for (int n = 0; n < n_1; n++) {
      tmp_buf[n] = ReduceOp(tmp_buf, float(0), (tmp_buf[n]) + (b[m, n]), {nm}); # <<< n is out, m is reduce here
    }
  }
  for (int n = 0; n < n_1; n++) {
    sum[0] = ReduceOp(sum, float(0), (sum[0]) + (tmp_buf[n]), {n});
  }
}
```

With this change we explicitly show which args are reduce args:
```
BEFORE RFACTOR:
{
  for (int m = 0; m < m_1; m++) {
    for (int n = 0; n < n_1; n++) {
      sum[0] = ReduceOp(sum, float(0), (sum[0]) + (b[m, n]), out_args={}, reduce_args={m, n});
    }
  }
}
AFTER RFACTOR:
{
  for (int m = 0; m < m_1; m++) {
    for (int n = 0; n < n_1; n++) {
      tmp_buf[n] = ReduceOp(tmp_buf, float(0), (tmp_buf[n]) + (b[m, n]), out_args={n}, reduce_args={m});
    }
  }
  for (int n = 0; n < n_1; n++) {
    sum[0] = ReduceOp(sum, float(0), (sum[0]) + (tmp_buf[n]), out_args={}, reduce_args={n});
  }
}
```

Test Plan: Imported from OSS

Differential Revision: D21265807

Pulled By: ZolotukhinM

fbshipit-source-id: 384396cd55562570f8e33657b856a4404d451080
2020-04-27 18:49:29 -07:00
a4383266f0 Revert D21262421: [pytorch][PR] [doc] Fix JIT code highlighting
Test Plan: revert-hammer

Differential Revision:
D21262421

Original commit changeset: 4fb62cce9543

fbshipit-source-id: 4e852e178a2469d94ddbf8ee18903ed8cebd4906
2020-04-27 18:30:18 -07:00
f1e89fbe53 [pytorch] add missing host-device attribute to fix clang build (#37358)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/37358

Test Plan:
```lang=bash
buck build mode/opt -c fbcode.cuda_use_clang=true //vision/fair/detectron2/tools:benchmark
```

Reviewed By: ngimel

Differential Revision: D21262235

fbshipit-source-id: 00633352d87da0881b2cc90759265fa0d0bd96be
2020-04-27 18:24:20 -07:00
fae87908d9 Back out "Fix NaN error in dynamic quantization in qLinear, re-enable test_quantized_rnn"
Summary:
Original commit changeset: 948a888f5516

(Note: this ignores all push blocking failures!)

Test Plan: Revert

Reviewed By: jamesr66a

Differential Revision: D21269147

fbshipit-source-id: 0204c360ef2c3f28c2b2fbe367eb4cfd77f717c4
2020-04-27 18:01:44 -07:00
cf41f6bed1 Fix record_function (#37364)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/37364

Test Plan: Imported from OSS

Reviewed By: ilia-cher

Differential Revision: D21265347

Pulled By: jamesr66a

fbshipit-source-id: f2fe2f6879ea220501fc518977cdb6a6d3529f87
2020-04-27 17:56:10 -07:00
ed0a572eed Migrate scatter and scatter_ from the TH to Aten (CUDA) (#35697)
Summary:
Fixes [24621](https://github.com/pytorch/pytorch/issues/24621).

Some preliminary results:
## Case 1: dense indexing
```python
import torch
import numpy
from IPython import get_ipython

numpy.random.seed(13)
torch.manual_seed(13)

ipython = get_ipython()

Ms=1024 * 8
Ns=1024 * 4
dim = 0
top_power = 4

for pM in range(top_power):
    M = Ms * (2 ** pM)
    for pN in range(top_power):
        N = Ns * (2 ** pN)
        input_ = torch.rand(M, N, device=torch.device('cuda'))
        src = torch.rand(M, N, device=torch.device('cuda'))
        index = torch.tensor(numpy.random.randint(0, min(M, N), (M, N)), device=torch.device('cuda') )
        print(f"Problem size (MxN): {M}x{N}")
        ipython.magic("timeit input_.scatter_(0, index, src); torch.cuda.synchronize()")
        ipython.magic("timeit input_.scatter_(1, index, src); torch.cuda.synchronize()")

```
### TH
```
Problem size (MxN): 8192x4096
11.5 ms ± 4.52 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
1.21 ms ± 1.23 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Problem size (MxN): 8192x8192
24.1 ms ± 2.69 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
2.49 ms ± 26.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Problem size (MxN): 8192x16384
48.5 ms ± 4.33 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
5.3 ms ± 23 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Problem size (MxN): 8192x32768
97.5 ms ± 3.82 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
12.2 ms ± 21.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Problem size (MxN): 16384x4096
22.9 ms ± 1.96 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
2.43 ms ± 12.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Problem size (MxN): 16384x8192
48.2 ms ± 3.6 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
5.03 ms ± 13 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Problem size (MxN): 16384x16384
97.6 ms ± 5.54 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
10.2 ms ± 7.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Problem size (MxN): 16384x32768
196 ms ± 8.61 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
20.2 ms ± 160 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Problem size (MxN): 32768x4096
45.8 ms ± 4.11 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
4.85 ms ± 6.77 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Problem size (MxN): 32768x8192
96.4 ms ± 3.98 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
10 ms ± 6.25 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Problem size (MxN): 32768x16384
195 ms ± 7.16 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
20.3 ms ± 161 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Problem size (MxN): 32768x32768
391 ms ± 36.3 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
40.7 ms ± 166 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Problem size (MxN): 65536x4096
91.5 ms ± 5.9 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
9.65 ms ± 3.93 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Problem size (MxN): 65536x8192
192 ms ± 9.94 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
20.1 ms ± 36 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Problem size (MxN): 65536x16384
390 ms ± 26.9 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
40.7 ms ± 207 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Problem size (MxN): 65536x32768
783 ms ± 33.7 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
86.9 ms ± 76.1 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

```
### ATen
```
Problem size (MxN): 8192x4096                                                                                                            [49/1095]
12 ms ± 3.71 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
1.19 ms ± 236 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Problem size (MxN): 8192x8192
25.1 ms ± 3.91 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
2.38 ms ± 17.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Problem size (MxN): 8192x16384
50.6 ms ± 2.21 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
4.62 ms ± 15.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Problem size (MxN): 8192x32768
102 ms ± 5.16 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
9.26 ms ± 20.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Problem size (MxN): 16384x4096
23.9 ms ± 3.01 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
2.37 ms ± 7.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Problem size (MxN): 16384x8192
50.2 ms ± 3.08 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
4.83 ms ± 8.46 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Problem size (MxN): 16384x16384
102 ms ± 3.97 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
9.79 ms ± 6.48 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Problem size (MxN): 16384x32768
204 ms ± 20.7 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
19.1 ms ± 32.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Problem size (MxN): 32768x4096
47.8 ms ± 2.2 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
4.72 ms ± 8.92 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Problem size (MxN): 32768x8192
100 ms ± 4.53 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
9.68 ms ± 7.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Problem size (MxN): 32768x16384
203 ms ± 21.1 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
19.6 ms ± 17.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Problem size (MxN): 32768x32768
408 ms ± 19.6 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
39.4 ms ± 229 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Problem size (MxN): 65536x4096
95.6 ms ± 3.77 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
9.41 ms ± 735 ns per loop (mean ± std. dev. of 7 runs, 100 loops each)
Problem size (MxN): 65536x8192
201 ms ± 12.6 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
19.3 ms ± 21.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Problem size (MxN): 65536x16384
407 ms ± 16.2 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
39.2 ms ± 207 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Problem size (MxN): 65536x32768
816 ms ± 40.3 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
78.4 ms ± 45.5 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

```

## Case 2: sparse indexing
```python
import torch
from IPython import get_ipython

ipython = get_ipython()

torch.set_num_threads(1)

device = 'cuda'

nrows = 100000
ncols = 100000
dims = [nrows, ncols]

res = torch.randn(dims, device=device)
idx1 = torch.randint(dims[0], (1, dims[1]), device=device).long()
src1 = torch.randn(1, dims[1], device=device)
idx2 = torch.randint(dims[1], (dims[0], 1), device=device).long()
src2 = torch.randn(dims[0], 1, device=device)

ipython.magic("timeit res.scatter_(0, idx1, src1); torch.cuda.synchronize()")
ipython.magic("timeit res.scatter_(1, idx2, src2); torch.cuda.synchronize()")

```
### TH
```
199 µs ± 609 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
43.3 µs ± 95.2 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
```
### ATen
```
199 µs ± 1.16 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
119 µs ± 3.21 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
```

## Case 3: many-to-one, one-to-many
```python
import torch
from IPython import get_ipython

ipython = get_ipython()

torch.set_num_threads(1)

device = 'cuda'

nfeat = 10000
nrep = 5
a=torch.arange(nfeat, device=device).repeat_interleave(nrep)
batch=3 #batch can vary 1-200

res = torch.randn(100000, 100000, device=device)

for batch in [100, 500, 1000, 5000, 10000]:
    print("Batch: ", batch)
    c=torch.randint(3, (batch, nfeat * nrep), device=device).float()
    ipython.magic("timeit res.scatter_(1,a.unsqueeze(0).expand(batch,a.size(0)),c); torch.cuda.synchronize()")

    enum_values = [
        list(range(1, 201)),
        list(range(1000, 1020)),
        list(range(2000, 2010)),
        list(range(3000, 3206)),
    ]
    indices = torch.tensor([i for i, values in enumerate(enum_values) for _j in range(len(values))], device=device)
    c = torch.randint(3, (batch, 4), device=device).float()
    idx = indices.unsqueeze(0).expand(c.size(0), indices.size(0))
    src = c.repeat(1, idx.shape[-1] // c.shape[-1])
    ipython.magic("timeit res.scatter_(1,idx,src); torch.cuda.synchronize()")
    print()

```
### TH
```
Batch:  100
119 µs ± 287 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
14.7 µs ± 122 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

Batch:  500
534 µs ± 2.24 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
16.4 µs ± 21.7 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

Batch:  1000
1.06 ms ± 2.96 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
20.6 µs ± 53.8 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

Batch:  5000
5.28 ms ± 15.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
56.3 µs ± 93.2 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

Batch:  10000
10.6 ms ± 20.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
101 µs ± 148 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

```
### ATen
```
Batch:  100
63.9 µs ± 501 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
13.5 µs ± 350 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

Batch:  500
241 µs ± 535 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
14.8 µs ± 332 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

Batch:  1000
468 µs ± 1.09 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
16.7 µs ± 381 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

Batch:  5000
2.27 ms ± 5.59 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
31.1 µs ± 146 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

Batch:  10000
4.52 ms ± 5.54 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
54 µs ± 82.1 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

```

## Correctness (passed)
```python
import torch
import numpy
from IPython import get_ipython

numpy.random.seed(13)
torch.manual_seed(13)

ipython = get_ipython()

Ms=1024 * 2
Ns=1024 * 2
dim = 0
top_power = 5

for pM in range(top_power):
    M = Ms * (2 ** pM)
    for pN in range(top_power):
        N = Ns * (2 ** pN)
        input_ = torch.rand(M, N, device=torch.device('cuda'))
        input_clone_ = input_.clone()
        #src = torch.rand(M, N, device=torch.device('cuda'))
        src = torch.ones(M, N, device=torch.device('cuda'))
        index = torch.tensor(numpy.random.randint(0, min(M, N), (M, N)), device=torch.device('cuda') )
        other_index1 = torch.arange(0, N, device=torch.device('cuda')).repeat(M, 1)
        other_index0 = torch.arange(0, M, device=torch.device('cuda')).repeat(N, 1).t()
        print(f"Problem size (MxN): {M}x{N}")
        #ipython.magic("timeit input_.scatter_(0, index, src); torch.cuda.synchronize()")
        #ipython.magic("timeit input_.scatter_(1, index, src); torch.cuda.synchronize()")

        input_.scatter_(0, index, src)
        input_clone_.index_put_((index, other_index1), src);
        assert((input_ == input_clone_).all())

        input_.scatter_(1, index, src)
        input_clone_.index_put_((other_index0, index), src);
        assert((input_ == input_clone_).all())

```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35697

Differential Revision: D21258380

Pulled By: ngimel

fbshipit-source-id: aebf01474cc9caf0a1dc1041ca6b753e3981df2e
2020-04-27 17:32:03 -07:00
b8ec165c0d Fix failing test in test_torch.py (#37362)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/37362

Differential Revision: D21264829

Pulled By: anjali411

fbshipit-source-id: cec6af84630378f03cb3863c85e161776af236cd
2020-04-27 16:42:11 -07:00
20143e5f27 Revert D21245094: [resubmit] Enable global observers API
Test Plan: revert-hammer

Differential Revision:
D21245094

Original commit changeset: 595e41b18206

fbshipit-source-id: 90344b361857d76ce5db75438c949dad1f5f186b
2020-04-27 16:19:46 -07:00
d294c06287 Fetch TORCH_PYTHON_SRCS filelists from build_variables (#37267)
Summary:
In build-variables.bzl split filelist into `libtorch_python_core_sources` and `libtorch_python_distributed_sources`
Move jit passes from `glob_libtorch_python_sources()` to `libtorch_core_jit_sources` filelist

Validated that original `TORCH_PYTHON_SRCS` filelist matches one in `build_varaiables.bzl` by running the following script:
```
import os

def read_file(path):
    with open(path) as f:
        return f.read()

def get_cmake_torch_python_srcs():
    caffe2_cmake = read_file("torch/CMakeLists.txt")
    start = caffe2_cmake.find("set(TORCH_PYTHON_SRCS")
    end = caffe2_cmake.find(")", start)
    return caffe2_cmake[start:end+1]

def get_cmake_torch_python_srcs_list():
    _srcs = get_cmake_torch_python_srcs()
    unfiltered_list = [x.strip() for x in _srcs.split("\n") if len(x.strip())>0]
    return [x.replace("${TORCH_SRC_DIR}/","torch/") for x in unfiltered_list if 'TORCH_SRC_DIR' in x]

import imp
build_variables = imp.load_source('build_variables', 'tools/build_variables.bzl')
libtorch_python_sources = set(build_variables.libtorch_python_core_sources)
torch_python_srcs = set(get_cmake_torch_python_srcs_list())

print(set.difference(libtorch_python_sources, torch_python_srcs))
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37267

Test Plan: CI

Differential Revision: D21258292

Pulled By: malfet

fbshipit-source-id: bb6d7ee73c97cbe149a9021756b9a4c9fb3ce50e
2020-04-27 16:10:07 -07:00
1039b95ff0 [autograd] add documentation about multithread autograd (#37020)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37020

Add multithread autograd documentation to the doc note.

Test Plan: Imported from OSS

Differential Revision: D21260996

Pulled By: wanchaol

fbshipit-source-id: 91d523560268ae62d4c6d773121b282ba837a561
2020-04-27 15:53:21 -07:00
b3ada29584 Skip test_profiler_custom_op on ROCm (#37374)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/37374

Test Plan: Imported from OSS

Reviewed By: ngimel

Differential Revision: D21266446

Pulled By: jamesr66a

fbshipit-source-id: 405a14e92ae7222cc9163fc392c34066f75862f6
2020-04-27 15:47:41 -07:00
16f4501cd4 Improve checkpoint docs to warn users about detached gradient issues (#37266)
Summary:
See https://discuss.pytorch.org/t/training-with-gradient-checkpoints-torch-utils-checkpoint-appears-to-reduce-performance-of-model/78102/3?u=jwl for details.

Updated the docs to warn users about issues with checkpointing models that use `detach()` or `torch.no_grad()` to freeze their model layers/weights during training. When they do this, training with `checkpoint` will fail as it forces the outputs to require gradients when the model itself does not. Hence, during the backward pass it will output the error:
```
[4]<stderr>:RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn
```

Maybe it is possible to fix this directly in the code, but I am not sure how in the current codebase.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37266

Differential Revision: D21262558

Pulled By: mrshenli

fbshipit-source-id: 529cf370534504baf8937ef17dac5d6916fbf5ae
2020-04-27 15:25:23 -07:00
023c3575f0 [doc] Fix JIT code highlighting (#37338)
Summary:
Fix https://github.com/pytorch/pytorch/issues/36216

| Before | After |
| --- | --- |
| ![image](https://user-images.githubusercontent.com/6421097/80353700-55abec80-883b-11ea-9ae2-72f37ba23c16.png)| ![image](https://user-images.githubusercontent.com/6421097/80353403-ef26ce80-883a-11ea-885b-2a2963f79d20.png) |
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37338

Differential Revision: D21262421

Pulled By: mrshenli

fbshipit-source-id: 4fb62cce9543e6a4852828f58a279c36565f8c44
2020-04-27 15:04:42 -07:00
8dc5502cb1 Do not add special CUDNN search path rules for torch_python (#37349)
Summary:
Those rules never worked until https://github.com/pytorch/pytorch/pull/37275 and afterwards they are causing crashes in manywheels builds, because getting `cudnn` linked into `libtorch_python` and `libtorch_cuda` causes double-free exceptions, see: https://app.circleci.com/pipelines/github/pytorch/pytorch/160350/workflows/85696e1c-1e67-4780-8ceb-18bc0a614507/jobs/5254443
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37349

Test Plan: Enable `manywheels` build by temporarily enabling `manywheels` build on this PR and validate that it fixes the issue, see  https://app.circleci.com/pipelines/github/pytorch/pytorch/160796/workflows/13227fbc-97c0-47f6-9a87-e840e1a4b5de/jobs/5267315/steps

Differential Revision: D21264484

Pulled By: malfet

fbshipit-source-id: 01f1082cdb0c078cb0fdd7da381c037df7c89b6f
2020-04-27 14:58:48 -07:00
f463586739 Revert D20984966: [quant] Generalizing _calculate_dynamic_qparams in quantized test
Test Plan: revert-hammer

Differential Revision:
D20984966

Original commit changeset: 17437297adae

fbshipit-source-id: 30b9f7a2b2a772b2bf1c4b81cf99bddf37d4b179
2020-04-27 14:36:44 -07:00
f07b85b6a6 Revert D20984967: [quant] quantized reflection_pad1d
Test Plan: revert-hammer

Differential Revision:
D20984967

Original commit changeset: 4731f16ba05a

fbshipit-source-id: ad3b4edaeb837c9561c36c36a122a6f9c00cd0db
2020-04-27 14:35:28 -07:00
5fab4c30dd [resubmit] Enable global observers API (#37292)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37292

After adding c10::DispatchKey::Profiler the behavior of RecordFunction
observers is also controlled by the dispatch key,
this PR moves the logic outside of the profiler into the record function

Reviewed By: jamesr66a

Differential Revision: D21245094

fbshipit-source-id: 595e41b18206d2ba4cf639cb320f630907868b3f
2020-04-27 14:24:51 -07:00
e33c3e49d5 Fix hard-code cmake target (#37310)
Summary:
Fix https://github.com/pytorch/pytorch/issues/33928. Basically just move the dependency into a new imported target.

I'm not sure whether this modification will affect other parts, please test it throughly.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37310

Differential Revision: D21263066

Pulled By: ezyang

fbshipit-source-id: 7dc38f578d7e9bcb491ef5e122106fb66a33156f
2020-04-27 14:20:30 -07:00
c4401ea9ab Make test_quantize runnable (#37357)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/37357

Test Plan: Imported from OSS

Differential Revision: D21262896

Pulled By: jamesr66a

fbshipit-source-id: b39d5f751678a6f2f8c40b65fd2cdb96c58f7eaf
2020-04-27 13:46:52 -07:00
e8421807d8 [TensorExpr] Fix indendation in CudaPrinter. (#37305)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/37305

Test Plan: Imported from OSS

Differential Revision: D21245880

Pulled By: ZolotukhinM

fbshipit-source-id: aab0492130682cb603f4f51d69ccb56e9582dc60
2020-04-27 13:38:57 -07:00
e49ccdf211 [TensorExpr] Add IRPrinter::visit for AtomicAdd. (#37304)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/37304

Test Plan: Imported from OSS

Differential Revision: D21245881

Pulled By: ZolotukhinM

fbshipit-source-id: f2062c86227b1b5a9a4c0a3187da024a4db56d12
2020-04-27 13:36:52 -07:00
d167a7f654 Revert D21256854: [pytorch][PR] Add overloads of std:: math functions for c10::complex
Test Plan: revert-hammer

Differential Revision:
D21256854

Original commit changeset: 2112ba6b7992

fbshipit-source-id: b81c377f9cd33a493a63d1e666cbe6765516fca8
2020-04-27 13:23:34 -07:00
af9c3a3652 uniform_int_distribution does not support uint8_t (#37260)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37260

List of supported types here:
https://en.cppreference.com/w/cpp/numeric/random/uniform_int_distribution

Test Plan: CircleCI green, test compiles and passes on msvc.

Reviewed By: malfet

Differential Revision: D21237280

fbshipit-source-id: 51b09b87511e35bfe8a57ecd48ed772d587dba9b
2020-04-27 13:09:39 -07:00
045c588bc6 Enable use_c10_dispatcher: full for some more ops (#37273)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37273

The issues why those couldn't be `use_c10_dispatcher: full` have either been fixed or those ops have been newly introduced without the tag but could have used it.
Let's enable the tag for them.
ghstack-source-id: 102896116

Test Plan: waitforsandcastle

Differential Revision: D21242516

fbshipit-source-id: 5158ecc1ff6b34896f36904ea7bd7fcb4811a0bf
2020-04-27 13:05:42 -07:00
201ba13911 Correct $ANDROID_HOME string empty check (#37064)
Summary:
Updated file to correct shell code to test whether $ANDROID_HOME env variable is empty or not.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37064

Differential Revision: D21181787

Pulled By: IvanKobzarev

fbshipit-source-id: 40c1d79d0fb730c7f68aa7472ce9b2398e91f2a2
2020-04-27 11:16:51 -07:00
805c417ec9 Implement avg_pool2d kernel for channels_last (#35855)
Summary:
Implement avg_pool2d for channels_last. This will close https://github.com/pytorch/pytorch/issues/34996.

Performance compared with **avg_pool2d** contiguous can be found at ed6617c6bc/avg-pool2d-channels-last/avg-pool2d-naive.ipynb

cc csarofeen ptrblck
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35855

Differential Revision: D21187360

Pulled By: VitalyFedyunin

fbshipit-source-id: b654b56168bc3982be306b634c7ed2f92018a9e5
2020-04-27 11:06:10 -07:00
ec8006cc16 [ONNX] fix provider_version and add consistency test (#36797)
Summary:
forward port the test from pr gh-36795, xref issue gh-32561
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36797

Differential Revision: D21257034

Pulled By: ezyang

fbshipit-source-id: d217da0e74f00a433c904defc0bf3eb5f594fd5e
2020-04-27 11:00:23 -07:00
0048243f70 Check compiler -v to determine compiler (fix #33701) (#37293)
Summary:
As described in the issue (https://github.com/pytorch/pytorch/issues/33701) the compiler check
	for building cpp extensions does not work with ccache.
	In this case we check compiler -v to determine which
	compiler is actually used and check it.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37293

Differential Revision: D21256913

Pulled By: ezyang

fbshipit-source-id: 5483a10cc2dbcff98a7f069ea9dbc0c12b6502dc
2020-04-27 10:49:04 -07:00
6d409481b3 Add overloads of std:: math functions for c10::complex (#35725)
Summary:
Issue: https://github.com/pytorch/pytorch/issues/35284

~This depends on and contains https://github.com/pytorch/pytorch/pull/35524. Please review after the dependency gets merged and I will rebase to get a clean diff.~

The implementation of most functions follow the pattern

```C++
template<typename T>
C10_HOST_DEVICE c10::complex<T> some_function(c10::complex<T> x) {
#if defined(__CUDACC__) || defined(__HIPCC__)
  return static_cast<c10::complex<T>>(thrust::some_function(static_cast<thrust::complex<T>>(x)));
#else
  return static_cast<c10::complex<T>>(std::some_function(static_cast<std::complex<T>>(x)));
#endif
}
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35725

Differential Revision: D21256854

Pulled By: ezyang

fbshipit-source-id: 2112ba6b79923450feafd7ebdc7184a3eaecadb6
2020-04-27 10:32:16 -07:00
a08a9f3b82 Enable uint8 upsampling 2 (#35029)
Summary:
Hi everyone,

This is a supper small PR to enable `unit8` support for `nearest` up-sampling in `cpu` and `cuda`.
This works enables us to move forward with the support of 'uint8' images in 'torchvision`.

See impacted issues :
https://github.com/pytorch/vision/issues/1375
https://github.com/pytorch/vision/issues/1179#issuecomment-558197607

Note: I wanted to add a unit test to ensure we have the expected behavior. I could not locate the `upsampling` unit tests for `nearest`. I can add the test if you point me to the right location.

Thanks
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35029

Reviewed By: cpuhrsch

Differential Revision: D21227144

Pulled By: fmassa

fbshipit-source-id: 33c4b5188dedd8f7f872e9d797e2a9b58ee7315c
2020-04-27 10:25:10 -07:00
5c9d1e4824 Propagate module lints for mobile scripted module. (#37046)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37046
ghstack-source-id: 102669259

Creating a python api entry to generate mobile model lints which takes a scripted module as argument and returns a map of module lints.

The initial version is to create placeholder which included module bundled input as the first lint instance. More lints will be added in the future.

Test Plan: python test/test_optimizer.py

Reviewed By: dreiss

Differential Revision: D21164648

fbshipit-source-id: 9e8f4e19d74b5464a55cc73b9dc18f358c5947d6
2020-04-27 10:20:12 -07:00
5b9f7f7b0e [cmake] Add USE_SYSTEM_{GLOO,FP16,PTHREADPOOL,PSIMD,FXDIV,BENCHMARK} options (#14699) (#37277)
Summary:
These options are disabled by default, and are supposed to be used by
linux distro developers. With the existing shortcut option
USE_SYSTEM_LIBS toggled, these new options will be enabled as well.

Additionally, when USE_SYSTEM_LIBS is toggled, setup.py should
no longer check the existence of git submodules.

ezyang
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37277

Differential Revision: D21256999

Pulled By: ezyang

fbshipit-source-id: 84f97d008db5a5e41a289cb7bce94906de3c52cf
2020-04-27 09:37:27 -07:00
3a0ff3cd2f Generate environment restore script for Windows build jobs (#37319)
Summary:
for better debugging purposes
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37319

Differential Revision: D21257011

Pulled By: ezyang

fbshipit-source-id: 41c7f1aa440f3ea626536b64392cca32f7c32dd3
2020-04-27 08:33:12 -07:00
007163407c [cmake] Support "Generic" BLAS (#14699) (#37276)
Summary:
The "Generic" BLAS refers to the Netlib BLAS. This option is meaningful
to the Debian family due to the "update-alternatives" mechanism, which
enables the user to switch the libblas.so providers between different
implementations at runtime, such as ATLAS, OpenBLAS, and Intel MKL.
Such, building against generic BLAS provides much flexibility.

This new option is not documented in setup.py because it's only supposed
to be used by linux distro (especially Debian family) developersonly.

ezyang
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37276

Differential Revision: D21256877

Pulled By: ezyang

fbshipit-source-id: 55a5356653a1cfc763a5699b04afe5938f2007ec
2020-04-27 08:17:43 -07:00
22ac071d9a Add SWA to PyTorch mainline (#35032)
Summary:
This PR is based on the issue https://github.com/pytorch/pytorch/issues/29994#issue-524418771 and the discussion in the previous version of the PR https://github.com/pytorch/pytorch/pull/30559. Specifically, I followed the interface outlined in this [comment](https://github.com/pytorch/pytorch/pull/30559#issuecomment-574864768).

## Structure
- `torch/optim/swa_utils.py` contains the implementation of  `AveragedModel` class, `SWALR` learning rate scheduler and `update_bn` utility
- `test/test_optim.py` contains unit tests for the three components of SWA
- `torch/optim/swa_utils.pyi` describes the interface of `torch/optim/swa_utils.py`

The new implementation consists of
- `AveragedModel` class; this class creates a copy of a given model and allows to compute running averages of the parameters.
- `SWALR` learning rate scheduler; after a certain number of epochs switches to a constant learning rate; this scheduler is supposed to be chained with other schedulers.
- `update_bn` utility; updates the Batch Normalization activation statistics for a given model and dataloader; this utility is meant to be applied to `AveragedModel` instances.

For `update_bn` I simplified the implementation compared to the [original PR](https://github.com/pytorch/pytorch/pull/30559) according to the sugestions by vadimkantorov.

## Example
```python
loader, optimizer, model = ...
swa_model = torch.optim.swa_utils.AveragedModel(model)
# You can use custom averaging functions with `avg_fun` parameter
ema_avg = lambda p_avg, p, n_avg: 0.1 * p_avg + 0.9 * p
ema_model = torch.optim.swa_utils.AveragedModel(model,
                                    avg_function=ema_avg)
scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(optimizer,
                                    T_max=300)
swa_start = 160
swa_scheduler = SWALR(optimizer, start_epoch=swa_start, swa_lr=0.05)

for i in range(300):
     for input, target in loader:
         optimizer.zero_grad()
         loss_fn(model(input), target).backward()
         optimizer.step()
         scheduler.step()
         swa_scheduler.step()

     if i > swa_start:
         swa_model.update_parameters(model)

# Update bn statistics for the swa_model at the end
torch.optim.swa_utils.update_bn(loader, swa_model)
```

UPDATED:
```python3
loader, optimizer, model, loss_fn = ...
swa_model = torch.optim.swa_utils.AveragedModel(model)
scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max=300)
swa_start = 160
swa_scheduler = SWALR(optimizer, swa_lr=0.05)
for i in range(300):
     for input, target in loader:
         optimizer.zero_grad()
         loss_fn(model(input), target).backward()
         optimizer.step()
     if i > swa_start:
         swa_model.update_parameters(model)
         swa_scheduler.step()
     else:
         scheduler.step()

# Update bn statistics for the swa_model at the end
torch.optim.swa_utils.update_bn(loader, swa_model)
```

Fixes https://github.com/pytorch/pytorch/issues/29994
cc soumith vincentqb andrewgordonwilson vadimkantorov
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35032

Differential Revision: D21079606

Pulled By: vincentqb

fbshipit-source-id: e07f5e821f72ada63789814c2dcbdc31f0160c37
2020-04-27 07:42:19 -07:00
828d590b06 [ROCm] Update to ROCm 3.3 (#37247)
Summary:
CC ezyang .

ROCm 3.3 packages went live on 2020-04-01.  Tag 376 was pushed on 2020-04-15, so it should be based on ROCm 3.3.

The upgrade to ROCm 3.3 is required as part of the effort to stabilize ROCm CI.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37247

Differential Revision: D21256198

Pulled By: ezyang

fbshipit-source-id: 92ac21c0122eda360ec279d2c3d462c3e6bf4646
2020-04-27 06:51:13 -07:00
f41742ff2f [autograd] remove spinning for dist engine (#36606)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36606

This PR refactor the continuation logic of the async mode on autograd
engine, to avoid launch spinning works. To achieve that:
1. remove the continuation logic in
execute_graph_task_with_continuiation
2. separate the usage of execute_graph_task between dist_engine and
local engine, now dist_engine universally use
`execute_graph_task_until_ready_queue_empty` (a better name appreciated
here).
3. remove enqueue_blocked_task_on_cpu
4. remove the async mode in `execute_with_graph_task` as we don't need
to use it in dist_engine

Test Plan: Imported from OSS

Differential Revision: D21032731

Pulled By: wanchaol

fbshipit-source-id: 708ea3bc14815bdc151b56afa15eb85b4ac0f4b1
2020-04-26 22:23:30 -07:00
ed9ec3c96f [autograd] refactor some functions (#37061)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37061

This PR refactors:
1. `set_device` to make it out of Engine
2. put `graph_task_completed` into GraphTask
3. put `mark_graph_task_completed` into GraphTask

This also make the distributed engine easy to call those functions.

Test Plan: Imported from OSS

Differential Revision: D21188688

Pulled By: wanchaol

fbshipit-source-id: f56106e6ed7d966cfa4d962781c7865cc3c5321d
2020-04-26 22:21:59 -07:00
47fec01c45 Fix cpp extension compile failure on some envs (#37221)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/37221

Test Plan: Imported from OSS

Differential Revision: D21226873

Pulled By: glaringlee

fbshipit-source-id: 0a390bbeaf153ee5ec355943f92c2dbcc5e04b59
2020-04-26 11:00:20 -07:00
b428f454e1 Revert D18927220: if_constexpr for C++14
Test Plan: revert-hammer

Differential Revision:
D18927220

Original commit changeset: 19a135e00af6

fbshipit-source-id: a1b8755a27903b98b742881b3ecce4f5e99543b2
2020-04-26 04:27:53 -07:00
b64fc3c4b5 Changes warnings generated in cpp to show point of Python origination (#36052)
Summary:
Today in PyTorch, warnings triggered in C++ are printed to Python users like this:

`../aten/src/ATen/native/BinaryOps.cpp:81: UserWarning: Integer division of tensors using div or / is deprecated, and in a future release div will perform true division as in Python 3. Use true_divide or floor_divide (// in Python) instead.`

This may be unhelpful to Python users, who have complained it's difficult to relate these messages back to their programs. After this PR, warnings that go through the PyWarningHandler and allow it to add context print like this:

```
test/test_torch.py:16463: UserWarning: Integer division of tensors using div or / is deprecated, and in a future release div will perform true division as in Python 3. Use true_divide or floor_divide (// in Python) instead. (Triggered internally at  ../aten/src/ATen/native/BinaryOps.cpp:81.)
  cpu_result = getattr(cpu_tensor, op_str)(*cpu_args)
```

This relates the warning back to the user's program. The information about the cpp file and line number is preserved in the body of the warning message.

Some warnings, like those generated in the JIT, already account for a user's Python context, and so they specify that they should be printed verbatim and are unaffected by this change. Warnings originating in Python and warnings that go through c10's warning handler, which prints to cerr, are also unaffected.

A test is added to test_torch.py for this behavior. The test relies on uint8 indexing being deprecated and its warning originating from its current header file, which is an unfortunate dependency. We could implement a `torch.warn` function, instead.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36052

Differential Revision: D20887740

Pulled By: mruberry

fbshipit-source-id: d3515c6658a387acb7fccaf83f23dbb452f02847
2020-04-25 21:18:58 -07:00
f8ec51bd86 Ensure DataParallel replicas can be saved (#37307)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/37182

The `zero_grad` wrapper from `_replicate_for_data_parallel` can't be pickled. So instead, I set an attribute `_is_replica = True` and check for this in `Module.zero_grad`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37307

Differential Revision: D21246119

Pulled By: mrshenli

fbshipit-source-id: 4755786d48a20bc247570ba672de9dd526914ce1
2020-04-25 20:57:24 -07:00
2b050371b4 Make listenLoopInternal non-virtual (#37265)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37265

In PGA, `listenLoopInternal` should not be virtual - PGA doesn't have any child classes that override this. Re-arranged some comments for `listenLoop` as well.
ghstack-source-id: 102880792

Test Plan: Sandcastle/CI

Differential Revision: D21238761

fbshipit-source-id: 5ec5058bc462182cf970faca9a734c11c7be2a32
2020-04-25 20:14:04 -07:00
d98ea604f4 Improve Error Message for Dist Autograd Context Cleanup Failure (#37255)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37255

Improved error message logged when Distributed Autograd Context cleanup fails - added node information and underlying error. The previous error message also assumed that the cause of the error was due to too many RPC's failing, but this is not necessarily the case.
ghstack-source-id: 102867620

Test Plan: Ensuring Sandcastle/CI tests pass. Verified the correct message is logged when this code path is executed in `test_backward_node_failure` and `test_backward_node_failure_python_udf` .

Differential Revision: D20950664

fbshipit-source-id: 267318187b7ef386930753c9679a5dfab6d87018
2020-04-25 19:25:07 -07:00
b198796a28 [quant] quantized reflection_pad1d (#36450)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/36450

Test Plan: Imported from OSS

Differential Revision: D20984967

Pulled By: z-a-f

fbshipit-source-id: 4731f16ba05a6aa57636d9ab85f12dfdeebcf08d
2020-04-25 18:21:45 -07:00
7604f470ed Add weight info in debug_ssa_net (#37262)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37262

It's convenient to have weights info in the debug_ssa_net so that we can tell what is weight and what is primary inputs. We can get their shape and size info with some post-processing script easily.

Reviewed By: ChunliF

Differential Revision: D21237537

fbshipit-source-id: 1fadc605283ef2eed78c44494e062a16ccf135ab
2020-04-25 18:07:28 -07:00
92e91cee8d ONNX Export Support for CrossEntropyLoss (#34830)
Summary:
Add ONNX export support for torch.nn.CrossEntropyLoss.

This PR makes following changes:
1. Updates nll_loss export
2. Makes a post pass for SoftmaxCrossEntropy
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34830

Reviewed By: hl475

Differential Revision: D21230712

Pulled By: houseroad

fbshipit-source-id: c81911a41968e23813ba10274340ce4d8ba1ed78
2020-04-25 17:56:53 -07:00
205c6ffbc5 [quant] Generalizing _calculate_dynamic_qparams in quantized test (#36449)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/36449

Test Plan: Imported from OSS

Differential Revision: D20984966

Pulled By: z-a-f

fbshipit-source-id: 17437297adae813bc5c6fa43c6c7514f72ce2f6c
2020-04-25 17:06:40 -07:00
ca39f99d48 [Pytorch Numeric Suite] Add module level comparison (#37242)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37242

Add module level comparison API.
ghstack-source-id: 102853727

Test Plan: buck test mode/dev caffe2/test:quantization -- 'test_compare_model_stub'

Reviewed By: raghuramank100

Differential Revision: D21232277

fbshipit-source-id: de707eea101a66a37869129460274c56e4e07db2
2020-04-25 16:46:10 -07:00
a04022c656 Use std::chrono::high_resolution_clock for profiling on Mac (#37280)
Summary:
According to Darwin man-page:
    `CLOCK_REALTIME` the system's real time (i.e. wall time) clock, expressed as the amount of time since the Epoch.  This is the same as the value returned by `gettimeofday`(2).

I.e. its returns timestamp with microsecond resolution, as can be obvserved by running following small program:
```
#include <sys/time.h>
#include <stdint.h>
#include <stdbool.h>
#include <stdio.h>

bool conseq_time(clockid_t c) {
  struct timespec t1, t2;
  clock_gettime(c, &t1);
  clock_gettime(c, &t2);
  printf("t1={.tv_sec=%ld, .tv_nsec=%ld}\n", t1.tv_sec, t1.tv_nsec);
  printf("t2={.tv_sec=%ld, .tv_nsec=%ld}\n", t2.tv_sec, t2.tv_nsec);
  bool rc = t1.tv_sec == t2.tv_sec && t1.tv_nsec == t2.tv_nsec;
  printf("Two timestamps are %sequal\n", rc ? "" : "not ");
  return rc;
}

int main(void) {
  printf("using CLOCK_REALTIME\n");
  conseq_time(CLOCK_REALTIME);
  printf("using CLOCK_MONOTONIC_RAW\n");
  conseq_time(CLOCK_MONOTONIC_RAW);
  return 0;
}
```
which if compiled outputs something like:
```
using CLOCK_REALTIME
t1={.tv_sec=107519, .tv_nsec=860315000}
t2={.tv_sec=107519, .tv_nsec=860315000}
Two timestamps are equal
using CLOCK_MONOTONIC_RAW
t1={.tv_sec=107520, .tv_nsec=954297363}
t2={.tv_sec=107520, .tv_nsec=954297426}
Two timestamps are not equal
```

But why do it, if all this platform specific logic is already nicely abstracted in `std::chrono::`:
https://github.com/llvm/llvm-project/blob/master/libcxx/src/chrono.cpp#L117
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37280

Differential Revision: D21246608

Pulled By: malfet

fbshipit-source-id: 6beada30657a2720000e34214b1348112e55be50
2020-04-25 15:57:08 -07:00
59052e39b8 [quant] qtensor resize (#36442)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/36442

Test Plan: Imported from OSS

Differential Revision: D20984080

Pulled By: z-a-f

fbshipit-source-id: 7fcf24bd2f92f038b670f510118b012d8c7acc74
2020-04-25 15:52:35 -07:00
bf860a4eba Adds missing documentation . (#37295)
Summary:
Fixes torch.isclose documentation missing a `.`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37295

Differential Revision: D21245426

Pulled By: mruberry

fbshipit-source-id: 88ce57ed68c2eac6aa83932780a6ba30e9fa69ea
2020-04-25 15:36:35 -07:00
34284c1279 Fix NaN error in dynamic quantization in qLinear, re-enable test_quantized_rnn (#36009)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36009

When scale is very small (less than float eps, but greater than minimum double precision value), computation of reciprocal of scale in floating point precision within FBGEMM returns inf, while QuantUtils does not. Changed computation in QuantUtils to occur with floating point precision to re-enable tests.
ghstack-source-id: 102896302

Test Plan:
buck test caffe2/test:quantization -- 'test_quantized_rnn \(quantization\.test_quantization\.PostTrainingDynamicQuantTest\)' --print-passing-details --run-disabled
Summary (total time 59.91s):
  PASS: 1
  FAIL: 0
  SKIP: 0
  FATAL: 0
  TIMEOUT: 0
  OMIT: 0

Differential Revision: D20853000

fbshipit-source-id: 948a888f5516b3ba9c6efb7de31ef2cc9d431991
2020-04-25 14:52:56 -07:00
84a31fb4e7 Revert D18927221: Boxing uses if_constexpr instead of SFINAE
Test Plan: revert-hammer

Differential Revision:
D18927221

Original commit changeset: 70d99025b45e

fbshipit-source-id: a4b650bbb6d76dda6086d88eb554f3c3077b0f76
2020-04-25 14:22:41 -07:00
c90955e3d1 [profiler] Sort by end interval as well when parsing CPU trace (#37297)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/37297

Test Plan: Imported from OSS

Reviewed By: ngimel

Differential Revision: D21245463

Pulled By: jamesr66a

fbshipit-source-id: 8d307eaa32fa960b93dfd9a3b0b4c767fd903094
2020-04-25 13:58:30 -07:00
ea741f829e Add --repeat option to python unit-test (#37281)
Summary:
This would run same testsuite (or individual test) multiple time
Useful for detecting flaky tests

Example usage: `python test_autograd.py TestAutograd.test_profiler -v --repeat=100`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37281

Differential Revision: D21244442

Pulled By: malfet

fbshipit-source-id: 3ecafec7ae87bc1e418aa28151bbc472ef37a713
2020-04-25 13:56:58 -07:00
44345ad08c Do not define C10_IOS on Mac (#37283)
Summary:
Because MacOS is not iOS
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37283

Test Plan: CI

Differential Revision: D21244398

Pulled By: malfet

fbshipit-source-id: b822e216e83887e2f2961b5c5384eaf749629f61
2020-04-25 13:52:46 -07:00
cb27067b32 [ONNX] Remove inverse op (#37005)
Summary:
ONNX inverse op is being removed.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37005

Reviewed By: hl475

Differential Revision: D21230728

Pulled By: houseroad

fbshipit-source-id: 7e10414918c57938cda4ca03875c070319d429fb
2020-04-25 12:23:15 -07:00
b18f57e548 Boxing uses if_constexpr instead of SFINAE (#31092)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31092

-
ghstack-source-id: 102878439

Test Plan: unit tests

Reviewed By: ezyang

Differential Revision: D18927221

fbshipit-source-id: 70d99025b45edfaef11a0d587cf8bf8e749df6b8
2020-04-25 11:34:04 -07:00
f5e6f1f333 if_constexpr for C++14 (#31091)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31091

This implements a C++17 "if constexpr" like feature for C++14.
This can be used, for example, to replace SFINAE or to force the compiler to remove some parts of a function in the assembly based on a condition.
PRs stacked on top will use this to simplify some of our template metaprogramming.
ghstack-source-id: 102867141

Test Plan: unit tests

Differential Revision: D18927220

fbshipit-source-id: 19a135e00af6ebb0139ce3730353762d4512158f
2020-04-25 11:31:51 -07:00
04b36fc264 [TensorExpr] rfactor implementation (#36237)
Summary:
A similar interface to Halide's rfactor: https://halide-lang.org/tutorials/tutorial_lesson_18_parallel_associative_reductions.html
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36237

Reviewed By: zheng-xq

Differential Revision: D21233309

Pulled By: bwasti

fbshipit-source-id: d2706a9e90b707ee195e339f834ff4a54b63a256
2020-04-25 10:01:31 -07:00
c52deb694e Consolidate usage on torch::jit::toPyObject in RPC request_callback (#37249)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/37249

Test Plan: Imported from OSS

Differential Revision: D21234990

Pulled By: mrshenli

fbshipit-source-id: d07210151342bd2ad12d1364d9f22817ee59b0c2
2020-04-25 09:37:38 -07:00
3d934c3d36 Add using torch::utils::Future to simplify code in RRefContext (#36811)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/36811

Test Plan: Imported from OSS

Differential Revision: D21093846

Pulled By: mrshenli

fbshipit-source-id: 61a6b1483ef1533803a18bec216ebe82aa187458
2020-04-25 09:37:33 -07:00
269ec9a139 Prevent RRef.to_here() to block an RPC thread on the callee using Future callbacks (#36805)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/36805

Test Plan: Imported from OSS

Differential Revision: D21093847

Pulled By: mrshenli

fbshipit-source-id: 81b0934874af36e03329fe6176628e3aca12811f
2020-04-25 09:37:28 -07:00
6e1e55c134 Prevent RRef unpickle to block waiting for OwnerRRef creation (#36785)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36785

Currently, RRef unpickle (both Python and TorchScript) will block
until the OwnerRRef has been created by the original `rpc.remote`
call, if it is an OwnerRRef. This is not ideal as the correctness
would then depends on the number of threads configuration. This
commit changed that behavior. Both `rpc.remote` and the unpickle
can create OwnerRRefs. More specifically, whichever one arrives
first will create the OwnerRRef and the subsequent ones will
retrieve the same OwnerRRef, so that no one is blocking.

Test Plan: Imported from OSS

Differential Revision: D21083089

Pulled By: mrshenli

fbshipit-source-id: 34ef063d50549b01c968b47815c4fe9fac179d3d
2020-04-25 09:36:02 -07:00
d7f7c290e3 addmv migration [resubmit] (#37236)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/37236

Differential Revision: D21232988

Pulled By: anjali411

fbshipit-source-id: ac6c0ee018aef3c841b039d76e6e1fbb3cd0292d
2020-04-25 07:43:27 -07:00
856e8cf028 Revert D21213786: Enable global observers API
Test Plan: revert-hammer

Differential Revision:
D21213786

Original commit changeset: e618254da74a

fbshipit-source-id: 425ea5d44fa55655ec0dd586c5075996b926177b
2020-04-25 00:59:24 -07:00
e6231c9e24 Do not run valgrind on the Aten unit tests compiled with clang (#37152)
Summary:
Valgrind detects some unitialized variables if torch_cpu is compiled with clang, which are not reproducible if the same code is compiled with gcc nor using address sanitizer tool
See https://github.com/pytorch/pytorch/issues/37117
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37152

Differential Revision: D21241577

Pulled By: malfet

fbshipit-source-id: 4a5dddf2a4fc4238dc9117cb92ee4e34af9e6064
2020-04-25 00:11:28 -07:00
6e659e928b Enable global observers API (#37195)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37195

After adding c10::DispatchKey::Profiler the behavior of RecordFunction
observers is also controlled by the dispatch key,
this PR moves the logic outside of the profiler into the record function

Reviewed By: ngimel

Differential Revision: D21213786

fbshipit-source-id: e618254da74a4f1ce16c51a3869bbd75a4f561ad
2020-04-24 23:49:28 -07:00
4e976b9334 Remove callBoxedWorkaround (#36850)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36850

Since now all unboxing happens after dispatch, which means that all c10 ops support unboxing, we can now use op.callBoxed() for all ops and don't need callBoxedWorkaround (which was going through the JIT registry) anymore.
ghstack-source-id: 102879558

Test Plan: waitforsandcastle

Differential Revision: D21102375

fbshipit-source-id: d1e041116563a9650d5a86b07eb96d217d8756f3
2020-04-24 23:13:31 -07:00
6ea2aedab9 Cast shape_.size() to int64_t before comparing with squash_dim (#37109)
Summary:
This is generating a considerable amount of warning messages since TensorIterator.h is included from a lot of files:

    /home/hong/xusrc/pytorch/aten/src/ATen/native/TensorIterator.h:372:47:
    warning: comparison of integers of different signs: 'const int64_t' (aka 'const long') and 'c10::SmallVectorTemplateCommon::size_type' (aka 'unsigned long') [-Wsign-compare]
        TORCH_CHECK(squash_dim >= 0 && squash_dim < shape_.size(),
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37109

Differential Revision: D21242163

Pulled By: ngimel

fbshipit-source-id: aec2978ee76750676a449eb6671142a782658de3
2020-04-24 22:39:53 -07:00
30eb0bdf32 Do not define list "0" in torch/CMakeLists.txt (#37275)
Summary:
Per https://cmake.org/cmake/help/latest/command/list.html list insert arguments order is
`list(INSERT <list> <index> [<element>...])`

That is first argument is list name not the index it gets inserted into
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37275

Differential Revision: D21243539

Pulled By: malfet

fbshipit-source-id: b947ad64f1a3549df68083383537899b19abd9ca
2020-04-24 21:32:13 -07:00
904949382e Ensure that histogram observers have zero-point of zero for post ReLU activations (#37107)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37107

Currently histogram observers relax both the min and max values of the activations for performance speedup reasons. This causes an issue for glow where there is a slow down if the zero-point is not zero for post ReLU activations.
ghstack-source-id: 102768017

Test Plan: buck test caffe2/test:quantization -- 'test_histogram_observer_one_sided \(quantization\.test_quantization\.RecordHistogramObserverTest\)' --print-passing-details

Differential Revision: D21187636

fbshipit-source-id: 8d616b9e9caf2979a26a215e99434f71025e3d8b
2020-04-24 20:57:34 -07:00
ef9ec03e77 [CUDA11] Pytorch change (#37187)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37187

Adding CUDACC guard for gcc9+

Reviewed By: ngimel

Differential Revision: D21209798

fbshipit-source-id: 5cc4efc7108577d74bee4c12c942ed1e5bf9bbac
2020-04-24 20:29:53 -07:00
a80a438e37 correctly set and restore states in te tests (#37210)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/37210

Differential Revision: D21238634

Pulled By: Krovatkin

fbshipit-source-id: 6462239753399c10c871baa5d5fdff5465cf2544
2020-04-24 20:16:51 -07:00
686b521784 Update cusparse deprecated Xcsrmm2 call (#37202)
Summary:
Reland of https://github.com/pytorch/pytorch/issues/36845 due to Windows CI failure.

binary_windows_wheel_3_7_cu102_build is passed, so the windows guard should be fine this time.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37202

Differential Revision: D21233358

Pulled By: xw285cornell

fbshipit-source-id: 707de0ff21d178686354ffaea7625f1d68b3e8d3
2020-04-24 20:12:21 -07:00
4a72ddedcd Show cpu info for macos jobs (#37220)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/37220

Differential Revision: D21243205

Pulled By: ezyang

fbshipit-source-id: 77a4d904e80c59b6d4d39b1a1a0fb441d8a35f0c
2020-04-24 19:53:34 -07:00
1d0334dd62 Add cpu build and test to Windows CI (#37135)
Summary:
Add windows build and test for cpu
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37135

Differential Revision: D21243189

Pulled By: ezyang

fbshipit-source-id: dd804ac258940e608facaf375d80ff5a0c59a7ae
2020-04-24 19:49:07 -07:00
1d8012a624 Delete dead code (#37254)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37254

This code is leftover from the KernelFactory deletion.
ghstack-source-id: 102866045

Test Plan: waitforsandcastle

Differential Revision: D21235480

fbshipit-source-id: 739ba677d2139ba9934d103f75a609638f1a3856
2020-04-24 18:08:31 -07:00
1f08ff12ec [jit] fix named tuples as attributes (#37251)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37251

This was broken by recent changes to how we serialize with type tags. We
save a name (like `Dict[str, MyNamedTuple]`) and then relied on the
mobile type parser to resolve that name back into a set of types.

This doesn't work for any NamedTypes as the mobile type parser doesn't
know how to resolve those. The unpickler allows the caller to inject a
type resolver in for this purpose, use that so that when importing in a
non-mobile environment you get the right results.

A second problem also had to be fixed: the SourceImporter type loader
would only load named types directly (e.g. `MyNamedTuple`) and choked if
it was a general type that contained a named tupe (e.g.
`List[MyNamedTuple]`). Fixed that and renamed `loadNamedType` to
`loadType` for clarity.

Test Plan: Imported from OSS

Differential Revision: D21235213

Pulled By: suo

fbshipit-source-id: 16db0f4c5e91a890d67a8687cc8ababa6b94b0f4
2020-04-24 17:48:44 -07:00
47c4dca1ab Remove python-2 or python<3.5 checks from unit tests (#37252)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/37252

Test Plan: CI

Differential Revision: D21241083

Pulled By: malfet

fbshipit-source-id: 44164b822f7905288abb2beda0175d2162d86143
2020-04-24 17:42:04 -07:00
521910e0e9 Update clang_format_ci.sh (#37268)
Summary:
shellcheck led me astray!
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37268

Differential Revision: D21241361

Pulled By: suo

fbshipit-source-id: 68244bb889e784ccd36d714209c2c15e2d6f04f8
2020-04-24 17:19:36 -07:00
b60c3dfdd9 Add fallback wrapper for profiler (#37194)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/37194

Test Plan: Imported from OSS

Reviewed By: ilia-cher, ngimel

Differential Revision: D21217886

Pulled By: jamesr66a

fbshipit-source-id: b06195e9ac110979d128391e067d5c9f416c1873
2020-04-24 16:24:58 -07:00
047488a7ff Mask all high dispatch keys in BackendSelect kernels (#37257)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37257

Previously, we were relying on fragile invariants to avoid collecting
and feeding high precedence, non-backend dispatch keys to backend
initialization machinery, which would assert on them. (These same
keys are then used for redispatch, so a second latent problem lurks
behind the first.) Here we mask off the BackendDispatch key and all
keys to its left.

Followup: move backend init code to backend-specific wrappers
(`CPUType` etc.). This will let us remove the backend init code from
both BackendSelect and STATIC_DISPATCH wrappers. (Though BackendSelect
will still need to compute a dispatch key, so the logic introduced
here will still be necessary.)

Test Plan: Imported from OSS

Differential Revision: D21235856

Pulled By: bhosmer

fbshipit-source-id: 1b8bd7897ed4b41a95718f3cfceddf4ee094744a
2020-04-24 16:12:39 -07:00
b6bb644e41 Fix long line splitting issue in python_print (#37088)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37088

For an inlined expression tree like `(e_0, (e_1, e_long))` the previous
algoritm only scanned the same statement as `e_long`, splitting the
inlined expressions across lines. Because it did not scan `e_0`, `e_0`
would still get emitted inline, causing it to reverse order with `e_1` and
`e_long`. The new algorithm scans starting at `e_long` and going all
the way back up the expression until it reaches the end of the inlined
statement. Caching of what has already been scanned has been added so that
if there was a second long long `e_long2` after `e_long`, it would not
rescan and re-inline the statements that were already split.

Test Plan: Imported from OSS

Differential Revision: D21180394

Pulled By: zdevito

fbshipit-source-id: 4d142c83a04c89a47d04282f67a513f82cf153c0
2020-04-24 15:14:39 -07:00
d6ce6570f9 Remove unused imports in aten/src/ATen/function_wrapper.py (#37245)
Summary:
typing is available since Python 3.5, no need to try-import.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37245

Differential Revision: D21236650

Pulled By: albanD

fbshipit-source-id: daf150103835d0c6cd3c39300044e548bb6d311d
2020-04-24 15:10:36 -07:00
4f3946a89b Added complex dtypes to get_all_math_dtypes, complex acc type for cpu, fixed rdiv and pow for complex (#37193)
Summary:
Resolves https://github.com/pytorch/pytorch/issues/36730 https://github.com/pytorch/pytorch/issues/36057
Partially resolves: https://github.com/pytorch/pytorch/issues/36671
```
>>> 2j / torch.tensor([4], dtype = torch.complex64)
tensor([(0.0000+0.5000j)], dtype=torch.complex64)
>>> 1 / torch.tensor(3+4j)
tensor((0.1200-0.1600j), dtype=torch.complex64)
```
rdiv is more generally broken for all dtypes because it doesn't promote the types properly
eg.
```
>>> 1 / torch.tensor(2)
tensor(0)
>>> 2j / torch.tensor(4)
tensor(0)
```
so that issue should be fixed in a separate PR

Adding CPU acc types for complex
Added cumsum, cumprod for complex dtypes

Added complex dtypes to get_all_math_dtypes to expand testing for complex dtypes

Old PR - https://github.com/pytorch/pytorch/pull/36747
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37193

Differential Revision: D21229373

Pulled By: anjali411

fbshipit-source-id: 8a086136d8c10dabe62358d276331e3f22bb2342
2020-04-24 15:05:50 -07:00
c38dcd45d7 [jit] fix return different types bug in tracing module calls (#37190)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37190

if module call return different types, we need to record them correctly

Test Plan: Imported from OSS

Differential Revision: D21214871

Pulled By: wanchaol

fbshipit-source-id: 46ba98f08ed4ade22f9740cb3fca84b29557e125
2020-04-24 14:49:28 -07:00
5362a0b948 [jit] fix lifting bug in tracing module calls (#37189)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37189

This fix bug in tracing module calls to correct lift values with its
correponding value type, rather than the default tensor type.

Test Plan: Imported from OSS

Differential Revision: D21214872

Pulled By: wanchaol

fbshipit-source-id: f635154851365e2d7b88186d6e47634123eac42f
2020-04-24 14:47:54 -07:00
a13b5b0ae8 Split reduction compile units (#37205)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/37205

Test Plan: Imported from OSS

Differential Revision: D21233254

Pulled By: ngimel

fbshipit-source-id: 68b37ebbdd715a30c616e425a39b6b21c01b37e2
2020-04-24 14:19:31 -07:00
9f02897431 Account for the change in optimizeForMobile API change.
Test Plan: TBD

Reviewed By: ayush29feb

Differential Revision: D21185736

fbshipit-source-id: fc7abc9c2eba8e6a390e54168b1fc4a17bf80e68
2020-04-24 13:21:56 -07:00
2baff9476e Test test_is_nonzero make expected exception inline (#37128)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37128

In certain build modes (in fbcode, building a .par) the mechanism to get test output "expect" files doesn't work.
All other tests in test_torch.py already had assertExpectedInline instead of assertExpected, with the expected result inline in the file.
There was no equivalent for assertExpectedRaises, so I added one, and changed the tests for test_is_nonzero (the only test using this)

Test Plan: CI, specifically the test test_is_nonzero should pass

Reviewed By: malfet

Differential Revision: D21197651

fbshipit-source-id: 2a07079efdcf1f0b0abe60e92cadcf55d81d4b13
2020-04-24 13:12:31 -07:00
deefafb01d Allow std::array as operator argument and return (#34399)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34399

Custom ops can now take std::array as arguments and return it.
This PR also moves the ops in native_functions.yaml that were blocked by this to now `use_c10_dispatcher: full`.
ghstack-source-id: 102643208

Test Plan: unit tests

Differential Revision: D20315072

fbshipit-source-id: 93232448663df962f65e0f25bfb35826dd3374f8
2020-04-24 13:07:14 -07:00
fc528ccbaf [wip] Allow ArrayRef as kernel parameter (#34335)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34335

ghstack-source-id: 102625264

Test Plan: waitforsandcastle

Differential Revision: D20296841

fbshipit-source-id: 123e6eae304dbca17d8f7474a79fb3b4769d23ad
2020-04-24 13:05:49 -07:00
93cd05b0f4 Fix CMake errors on systems where {Q/X}NNPACK is not supported (#35607)
Summary:
- add a couple of checks for USE_XNNPACK to disable additional
  code paths if XNNPACK is not supported

When passing through the code paths where the platform checks
are made (cmake/Dependencies.cmake:89), if XNNPACK is not
supported, then the var FXDIV_SOURCE_DIR will not be
set. CMake emits the errors when add_directory is called and
FXDIV_SOURCE_DIR is empty.

see: https://github.com/pytorch/pytorch/issues/34606
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35607

Differential Revision: D20895645

Pulled By: seemethere

fbshipit-source-id: 3bd10cf89f0fb6825fdd6e1d52c71ee37c67b953
2020-04-24 12:37:23 -07:00
6e92579883 Added autograd support for C->C functions and enabled requires_grad=True for complex (#36932)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/36932

Differential Revision: D21181230

Pulled By: anjali411

fbshipit-source-id: 295f2cd1e2b9918a8b2cb88cab0536b2407dc455
2020-04-24 12:30:49 -07:00
1beca4ac6a Prerequisites for CSPRNG (#36631)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36631

Summary of changes

1. Moved random transformation functions to DistributionHelper.h (`uniform_int_from_to_distribution`, `uniform_int_full_range_distribution`, `uniform_int_distribution`) to avoid code duplication between default CPU, CUDA rngs and custom rng extensions
2. Made GeneratorImpl fields protected instead of private
3. Introduced `TORCH_CHECK_IF_NOT_ON_CUDA` that does the same as `TORCH_CHECK` if it is not CUDA/ROCm device
4. To test multiple rng extensions I had to move ops registration to the method `registerOps()`, expose it to python and call it `def setUp(self)`

Test Plan: Imported from OSS

Differential Revision: D21229202

Pulled By: pbelevich

fbshipit-source-id: 6aa3280f2fc3324cf3e748388b5087e3a1e49f23
2020-04-24 12:25:37 -07:00
af08334c63 better local command for clang-format check (#37127)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37127

Wrap what we're running in CI in a small script so we can exactly reproduce it locally if ncessary.

Test Plan: Imported from OSS

Reviewed By: malfet

Differential Revision: D21196804

Pulled By: suo

fbshipit-source-id: 45497daae4bafd236a0d1bb1480841f0d9f39262
2020-04-24 12:19:57 -07:00
5a27ec09b8 Add Inverse Short Time Fourier Transform in ATen native (#35569)
Summary:
Ported `torchaudio`'s implementation (test, and documentation as well) to ATen.

Note
 - Batch packing/unpacking is performed in Python. ATen implementation expects 4D input tensor.
 - The way `hop_length` is initialized in the same way as `stft` implementation. [The Torchaudio's version tried to mimic the same behavior but slightly different](7da61a4bee/torchaudio/functional.py (L152-L157)).

Closes https://github.com/pytorch/pytorch/issues/34827
Relates https://github.com/pytorch/pytorch/issues/3775
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35569

Differential Revision: D21178090

Pulled By: mthrok

fbshipit-source-id: 2701a8b241a36a6fb1b740c2fb2b07cb938185d4
2020-04-24 12:14:55 -07:00
20328f67bb Add core of c10::complex [resubmit] (#36626)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36626

This reverts commit 9216c67c9eb0dfc58b530740d954997130e05b13.

Test Plan: Imported from OSS

Differential Revision: D21140441

Pulled By: anjali411

fbshipit-source-id: 488530088e2ff87dc27e70d21ace88ff2967e7ab
2020-04-24 12:08:23 -07:00
6ac0f67699 [C2] Optimize MulGradient Operator when inner_size is 1 (#36767)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36767

Add a simpler implementation of the MulGradient cuda kernel for when inner_size==1, inner loop is eliminated.

Reviewed By: xw285cornell

Differential Revision: D21013269

fbshipit-source-id: bb62470d91a7fef6eecc3d4766a2c994ca6bb2c8
2020-04-24 11:17:49 -07:00
cae77fa351 [doc] Fix broken links in the TOC of CONTRIBUTING.md (#37131)
Summary:
Some links in the TOC of CONTRIBUTING.md is broken since GitHub removes the invalid characters (e.g., `+` in C++) in the anchor link, while the existing TOC uses `-` for replacement.

This PR uses `-` instead of `*` and `+` for the bullet lists to make it consistent with README.md.
b889e0da8a/README.md (L11-L18)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37131

Differential Revision: D21231299

Pulled By: zou3519

fbshipit-source-id: 8e7bb61550827ce97378d3428542e43612bac8e1
2020-04-24 10:56:10 -07:00
385165ec67 [reland][quant] QuantizedCUDA implementation (#36936) (#37081)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37081

Closes https://github.com/pytorch/pytorch/issues/30813

Relanding of https://github.com/pytorch/pytorch/pull/35463

1. Tensor quantization logic(quantize_*) is moved to the aten/native/quantized. Previously all logic for tensor quantization lived in the aten/quantized/Quantizer.cpp file, and started to become complicated and hard to read. This problem should be addressed in refactoring PR. Still, I reworked this partially because I had to add tensor quantization logic for CUDA, and it was native to move everything to the aten/native/quantized.
2. Requirements to run CUDA_tensor_apply* was eased to process any tenser that lives on the CUDA device(QuantizedCUDA included).
3. All quantized data types now have a default constructor. NVCC refuses to compile any gpu_kernel or CUDA_tensor_apply* without them.
4. Minor changes in many files to register QuantizedCUDA backend.
5. test_quantized_tensor is extended to process QuantizedCUDA backend where possible.

Test Plan: Imported from OSS

Differential Revision: D21206694

Pulled By: jerryzh168

fbshipit-source-id: c7433aad9c095a34c57e6dddd128b5c5d9292373
2020-04-24 10:21:59 -07:00
77abb6938e Port register_string_ops.cpp to new operator registration API (#37008)
Summary:
Adresses https://github.com/pytorch/pytorch/issues/36925

We have a new operator registration API introduced in https://github.com/pytorch/pytorch/issues/36258, and we need to port all use sites of the old registration API to use it.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37008

Differential Revision: D21160557

Pulled By: jessebrizzi

fbshipit-source-id: 6bc0d57c40229cc7a477cde371c08479d4a4fe4f
2020-04-24 10:04:56 -07:00
8254a63802 Speed up calculate Qparams for per-channel observers (#30485)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30485

Use vectorization to speed up calculate Qparams for per-channel observers. New implementation is about 1000 times faster.

Task:
https://github.com/pytorch/pytorch/issues/30348#event-2824868602
ghstack-source-id: 102808561

Test Plan:
```
import torch
import time
import numpy as np
from torch.quantization.observer import PerChannelMinMaxObserver

obs = PerChannelMinMaxObserver()
acc_time = 0
X = torch.randn(1000, 10)
obs(X)
for i in range(100):
    start = time.time()
    obs.calculate_qparams()
    acc_time = acc_time + time.time()-start
print(acc_time)
```
Before change:
20.3

After change:
0.017

Differential Revision: D18711905

fbshipit-source-id: 3ed20a6734c9950773350957aaf0fc5d14827640
2020-04-24 07:32:36 -07:00
a50a1fb4c3 Enforce kw-only args now that py2 is unsupported (#37069)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/37069

Test Plan: Imported from OSS

Differential Revision: D21204729

Pulled By: nairbv

fbshipit-source-id: 8e93decae59e753706fa288bcdc3bf6278b8eeb5
2020-04-24 07:08:24 -07:00
35b9c89dc1 Revert D21045393: [PyTorch Numeric Suite] Add module level comparison
Test Plan: revert-hammer

Differential Revision:
D21045393

Original commit changeset: 4303805f732c

fbshipit-source-id: 06d8a234eda800eb14bc3aa58ff14b0d3cf86d86
2020-04-24 07:03:04 -07:00
fba9b9a023 [PyTorch Numeric Suite] Add module level comparison (#36669)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36669

Add module level comparison API.
ghstack-source-id: 102802362

Test Plan: buck test mode/dev caffe2/test:quantization -- 'test_compare_model_stub'

Differential Revision: D21045393

fbshipit-source-id: 4303805f732cc8c8fc67ce40d9594b664507bf82
2020-04-24 00:17:22 -07:00
827f04a075 Supporting create an RPC gang of size 1 (#32731)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32731

As we now support send to self, we no longer require world_size > 1.
Removing the assert from ProcessGroupAgent.

Test Plan: Imported from OSS

Differential Revision: D19609558

Pulled By: mrshenli

fbshipit-source-id: ecec18d756f97d8d78d4526a63b7cb8ab6f858a3
2020-04-23 22:53:13 -07:00
a633c2d112 Fix const-cast lint error in process_group_agent.cpp (#37184)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/37184

Test Plan: Imported from OSS

Differential Revision: D21213292

Pulled By: mrshenli

fbshipit-source-id: 1fb5447bc0033242e97cc2ec8c52b24e0cf61e8a
2020-04-23 22:52:03 -07:00
3ff892febb Remove redundant definition of fmadd functions in complex Vec256 (#37167)
Summary:
These straightforward implementations have already been covered in
vec256_base.h. No need to specialize the template function for complex
types.

7aec364bdf/aten/src/ATen/cpu/vec256/vec256_base.h (L696-L698)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37167

Differential Revision: D21222595

Pulled By: ezyang

fbshipit-source-id: fbe311cd39cafbb6ccec151f9d8e52450b17437f
2020-04-23 21:41:36 -07:00
070dea2d7e Updating submodules
Summary:
GitHub commits:

23198e8b72
4134ddf996
fc25d15e3a
b646e5df83
e076254b26
fc4852b397
9ae029fc23
e04f3bce4f
b734bf0646
55443f2b16
bad0017d5b
91063b0c3e

Test Plan: n/a

Reviewed By: wittgenst

fbshipit-source-id: a20549d3cdb6d1fcf4dacbaa31ba59e778d3d462
2020-04-23 21:34:46 -07:00
ff21b15624 cmake: add USE_SYSTEM_{LIBS,CPUINFO,SLEEF} options (#14699) (#37137)
Summary:
ezyang
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37137

Differential Revision: D21222632

Pulled By: ezyang

fbshipit-source-id: 47624b30f8d07b31a40a26edf665bbec39e45202
2020-04-23 20:43:36 -07:00
05e98149ae Refactor lambda post hook. (#37025)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37025

This allows us to reuse this framework in other places.

Test Plan:
buck test mode/dev-nosan
caffe2/torch/fb/distributed/model_parallel/tests:test_dist_optim --
test_optimizer_hook

Differential Revision: D20958327

fbshipit-source-id: 2a37dae3687fea8820427e174900111b58673194
2020-04-23 15:29:34 -07:00
35f7945828 Revert D21196366: [pytorch][PR] Update cusparse deprecated Xcsrmm2 call
Test Plan: revert-hammer

Differential Revision:
D21196366

Original commit changeset: 592d6bd6379f

fbshipit-source-id: 3bb8c090c13a1f4af0edea5b99d81fee53240ef9
2020-04-23 15:19:36 -07:00
fd5b5cd604 Allowing casting str to int in JIT (#36016)
Summary:
Changelog:
- Allow int(str) in TorchScript
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36016

Test Plan:
- Added tests in test_jit.py

Closes https://github.com/pytorch/pytorch/issues/35948

Differential Revision: D21076438

Pulled By: driazati

fbshipit-source-id: d0753dc0e1c79f4f943c303b58b2d228856ba793
2020-04-23 14:26:24 -07:00
989341c0c6 Add comments to explain how MultiProcessTestCase works (#37179)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/37179

Test Plan: Imported from OSS

Differential Revision: D21211196

Pulled By: mrshenli

fbshipit-source-id: 86dc8183b4def7e95a236f6c5c73ef67466f4ddb
2020-04-23 14:17:12 -07:00
c4b9f3bf55 Enable torch_speed_benchmark to accept different memory formats. (#36202)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/36202

Test Plan: Imported from OSS

Differential Revision: D20970216

Pulled By: AshkanAliabadi

fbshipit-source-id: bb5a260e5677716356eec6ad4daa1f3c65420bbd
2020-04-23 13:18:43 -07:00
ba3f8d35e0 Enable stateless XNNPACK linear. (#35791)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35791

The optimal solution to use XNNPACK is to separate operator creation
from execution - also called prepacking the weights.  If we have done
our job properly, JIT must have caught and replaced nn.Linear on mobile
with the prepacked versions.  Still, if we somehow end up in
at::native::linear for whatever reason, it is still more efficient to go
through XNNPACK than the alternatives of at::addmm or at::matmul.

Differential Revision: D20821863

Test Plan: Imported from OSS

Pulled By: AshkanAliabadi

fbshipit-source-id: 5a75bfd900435c89c1b8536dc09248e788292e0c
2020-04-23 13:18:38 -07:00
72f80b5247 Enable stateless XNNPACK convolutions. (#35790)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35790

The optimal solution to use XNNPACK is to separate operator creation
from execution - also called prepacking the weights.  If we have done
our job properly, JIT must have caught and replaced nn.Conv2ds on mobile
with the prepacked versions.  Still, if we somehow end up in
_convolution for whatever reason, it is still more efficient to go
through XNNPACK for NHWC tensors, compared to the alternative of
converting NHWC to NCHW and going through NNPACK.

Differential Revision: D20821864

Test Plan: Imported from OSS

Pulled By: AshkanAliabadi

fbshipit-source-id: 2732280c2fd31edcb39658f6530d03331a1a4a75
2020-04-23 13:17:10 -07:00
e98cdfa26f Migrate tanh from TH to ATen (CUDA) (#36995)
Summary:
Closes https://github.com/pytorch/pytorch/issues/24642

Benchmark with same build settings on same system.
gcc : version 7.5.0 (Ubuntu 7.5.0-3ubuntu1~18.04)
CUDA : 10.1
GPU : 1050ti

```python
import timeit

for n, t in [(10_000, 20000),
             (100_000, 20000)]:
    for dtype in ('torch.half', 'torch.float', 'torch.double'):
        print(f'torch.tanh(a) a.numel() == {n} for {t} times {dtype}')
        print(timeit.timeit(f'torch.tanh(a); torch.cuda.synchronize()',
                              setup=f'import torch; a=torch.arange({n}, dtype={dtype}, device="cuda")',
                              number=t))
```

Before:

```
torch.tanh(a) a.numel() == 10000 for 20000 times torch.half
0.2816318240002147
torch.tanh(a) a.numel() == 10000 for 20000 times torch.float
0.2728829070001666
torch.tanh(a) a.numel() == 10000 for 20000 times torch.double
0.39797203200214426
torch.tanh(a) a.numel() == 100000 for 20000 times torch.half
0.3228214350019698
torch.tanh(a) a.numel() == 100000 for 20000 times torch.float
0.31780802399953245
torch.tanh(a) a.numel() == 100000 for 20000 times torch.double
1.3745740449994628
```

After:

```
torch.tanh(a) a.numel() == 10000 for 20000 times torch.half
0.27825374500025646
torch.tanh(a) a.numel() == 10000 for 20000 times torch.float
0.27764024499992956
torch.tanh(a) a.numel() == 10000 for 20000 times torch.double
0.3771585260001302
torch.tanh(a) a.numel() == 100000 for 20000 times torch.half
0.2995866400015075
torch.tanh(a) a.numel() == 100000 for 20000 times torch.float
0.28355561699936516
torch.tanh(a) a.numel() == 100000 for 20000 times torch.double
1.393811182002537
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36995

Differential Revision: D21163353

Pulled By: ngimel

fbshipit-source-id: e2216ff62cdfdd13b6a56daa63d4ef1440d991d4
2020-04-23 12:29:27 -07:00
7aec364bdf extend gather shape check to handle incorrectly sized outputs (#37102)
Summary:
Fixes a safety issue (Nonsense values and segfaults) introduced by https://github.com/pytorch/pytorch/pull/36875 when in-place gather tries to use incorrect shapes.

Consider the following block of code:
```
k0 = 8
k1 = 8
m = 100

x = torch.rand((k0, k1))
ind = torch.randint(0, k0, (m, k1))
output = torch.empty((m, k1))

print(torch.gather(x, 0, ind, out=output))
print(torch.gather(x, 1, ind, out=output))
```

The first gather is legal, the second is not. (`ind` and `output` need to be transposed) Previously this was caught when the kernel tried to restride inputs for TensorIterator, but we can no longer rely on those checks and must test explicitly. If `m` is small the second gather returns gibberish; if it is large enough to push the read out of memory block the program segfaults.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37102

Differential Revision: D21190580

Pulled By: robieta

fbshipit-source-id: 80175620d24ad3380d78995f7ec7dbf2627d2998
2020-04-23 11:47:01 -07:00
006f1a32f8 Mobile CPU allocator. (#36032)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36032

QNNPACK AND XNNPACK may out-of-bound access the input and / or output tensors.
This is by-design, and chosen to make the implementation of micro-kernels
both simpler and faster as a result of not having to individually handle the
corner cases where the number of processed elements is not a multiple of SIMD
register width.  This behavior will trigger ASAN though, and may result in a
segfault if the accessed memory location just so happens to fall on a page
the current process has no read access to.  Here we define a custom allocator
that allocates the extra storage required to keep this behavior safe.  This
allocator could have been restricted to QNNPACK and XNNPACK only, but that
would have negative performance ramifications, as input tensors must now be
reallocated, and copied over, if the tensor is not allocated with this
allocator to begin with.  Making this allocator the default on mobile builds
minimizes the probability of unnecessary reallocations and copies, and
also enables acceleration of operations where the output tensor is allocated
outside of the function doing the implementation, wherein the implementation
cannot simply re-allocate the output with the guarding allocator.

Test Plan: Imported from OSS

Differential Revision: D20970217

Pulled By: AshkanAliabadi

fbshipit-source-id: 65cca2d38d7c0cef63c732f393016f50f1fa5199
2020-04-23 11:03:03 -07:00
ebfe631ed8 [TensorExpr] Cleanup TensorExprKernel class and add CPP tests for it. (#36952)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/36952

Differential Revision: D21139939

Test Plan: Imported from OSS

Pulled By: ZolotukhinM

fbshipit-source-id: a6605c0d6ccbb049ce27e6cdcc8fd8d2ebc057e3
2020-04-23 10:51:33 -07:00
c306f2ed08 Revert D20660338: [pytorch][PR] Migrate addmv and mv from legacy to ATen native (CUDA & CPU)
Test Plan: revert-hammer

Differential Revision:
D20660338

Original commit changeset: db1f521f1241

fbshipit-source-id: 8616ddd7bbd8f00351cfc45331a09b0bc9aa28ea
2020-04-23 10:46:45 -07:00
438aed63a1 Fix prelu_backward TensorIterator split (#36134)
Summary:
We should have
```C++
    for (auto& sub_iter : iter.with_32bit_indexing()) {
      launch_prelu_cuda_backward_share_weights_kernel(sub_iter, weight_data);
    }
```

But I mistakenly wrote it as

```C++
    for (auto& sub_iter : iter.with_32bit_indexing()) {
      launch_prelu_cuda_backward_share_weights_kernel(iter, weight_data);
    }
```

in my previous PR. Which leads to infinite recursion on it.

I found this bug when working on https://github.com/pytorch/pytorch/pull/34004

I also add a `TORCH_INTERNAL_ASSERT_DEBUG_ONLY` to test for this.

Besides, the caller is already guaranteed contiguous, so we don't need to handle no-contiguous tensors.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36134

Differential Revision: D21187542

Pulled By: VitalyFedyunin

fbshipit-source-id: 0fafdd7b672bf89fcaa2b42e08b7d41ade7e6bcb
2020-04-23 10:42:20 -07:00
230b68168b [quant] Refactor test files (#36964)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36964

Rename and restructure quantization related tests
https://github.com/pytorch/pytorch/issues/31625

Test Plan:
.

Imported from OSS

Differential Revision: D21192509

fbshipit-source-id: 148c93e86e0ea68ab18a067fe74a8035a29a1e4e
2020-04-23 10:28:56 -07:00
ab2a9ab925 Non-blocking SyncBatchNorm update (#36659)
Summary:
As shown in https://github.com/pytorch/pytorch/issues/36452 , SyncBatchNorm can block host thread due the ``MemcpyDtoH`` and ``MemcpyHtoD`` when dealing with argument ``counts`` for native function ``batch_norm_gather_stats_with_counts``.

- This fix change signiture of ``batch_norm_gather_stats_with_counts`` to
```c++
std::tuple<Tensor, Tensor> batch_norm_gather_stats_with_counts_cuda(const Tensor& self, const Tensor& mean, const Tensor& invstd, const Tensor& running_mean, const Tensor& running_var, double momentum, double epsilon, const Tensor& counts)
```
so it can directly receive "counts" in a ``CUDATensor`` rather than ``IntArrayRef`` whose data is in host memory.

- This fix also improve implementation of ``SyncBatchNorm`` function so the construction of ``counts`` tensor will not cause additional ``MemcpyHtoD``, which will block host thread, too.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36659

Differential Revision: D21196991

Pulled By: ngimel

fbshipit-source-id: 84a529e6cf22e03618fecbb8f070ec452f81229e
2020-04-23 10:22:19 -07:00
f11df2d2b4 Use temporary variable to store input parameters in loop. (#36288)
Summary:
The original implementation of maxpool and im2col kernels will fail if `gridSize` * `blockSize` is smaller than the `nthreads` in maxpool kernel or `n` in im2col kernel. Input parameters `bottom_data`, `data_col`, `data_im`, and loop index `index` are modified inside the loop body and the corrupted data will be carried to the second iteration.

This patch uses temporary variables to replace the input parameters and loop indices.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36288

Differential Revision: D21189020

Pulled By: VitalyFedyunin

fbshipit-source-id: a8075a35e707e6cc99cffd0b2177369e8caea37c
2020-04-23 10:16:58 -07:00
3799d1d74a Fix many doc issues (#37099)
Summary:
Fix https://github.com/pytorch/pytorch/issues/35643 https://github.com/pytorch/pytorch/issues/37063 https://github.com/pytorch/pytorch/issues/36307 https://github.com/pytorch/pytorch/issues/35861 https://github.com/pytorch/pytorch/issues/35299 https://github.com/pytorch/pytorch/issues/23108 https://github.com/pytorch/pytorch/issues/4661

Just a bunch of small updates on the doc.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37099

Differential Revision: D21185713

Pulled By: albanD

fbshipit-source-id: 4ac06d6709dc0da6109a6ad3daae75667ee5863e
2020-04-23 10:01:03 -07:00
9763db3031 MultiProcessTestCase to use instance rather than class method wrappers (#36826)
Summary:
This makes its wrappers stackable with `common_utils.TestCase` ones
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36826

Test Plan: CI

Differential Revision: D21178217

Pulled By: mrshenli

fbshipit-source-id: f80dd4aa175e20bd338b38b2c42c3118258f45dc
2020-04-23 08:40:02 -07:00
3880f14b64 Canonicalize includes in torch, and add tests for it (#36303)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/36303

Test Plan: Imported from OSS

Differential Revision: D20943003

Pulled By: ezyang

fbshipit-source-id: 81fcbaccc1a7eec422bd8347d196bb66a5467884
2020-04-23 08:09:21 -07:00
b3f04a398a Re-enable JIT test test_class_sorting (#37140)
Summary:
Closes gh-36902
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37140

Differential Revision: D21202738

Pulled By: ezyang

fbshipit-source-id: ff05384b8c0f625ed1aaaa82628af13d5496788b
2020-04-23 07:55:39 -07:00
11cef0fe88 Update cusparse deprecated Xcsrmm2 call (#36845)
Summary:
The new function signature https://docs.nvidia.com/cuda/cusparse/index.html#cusparse-generic-function-spmm.

Please also check https://docs.nvidia.com/cuda/cusparse/index.html#cusparse-generic-api-reference for the limitations. I have added windows guard in this PR.

> LIMITATION: The generic APIs are currently available for all platforms except Windows. Using these APIs in any other systems will result in compile-time or run-time failures. Their support will be extended in the next releases.

Edit: also add a cuda guard to let ROCm use old version API (avoid build failures)

Since the new cusparse signatures sometimes give inaccurate results in CUDA 10.1, and this was fixed in CUDA 10.2, the new signatures should only be used with CUDA >= 10.2

cc csarofeen ptrblck
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36845

Differential Revision: D21196366

Pulled By: ezyang

fbshipit-source-id: 592d6bd6379f7db52cbad827d43864ea65ff18ea
2020-04-23 07:05:44 -07:00
a38c6e0454 Migrate addmv and mv from legacy to ATen native (CUDA & CPU) (#30898)
Summary:
Fixes: https://github.com/pytorch/pytorch/issues/24605 https://github.com/pytorch/pytorch/issues/24535 https://github.com/pytorch/pytorch/issues/24739 https://github.com/pytorch/pytorch/issues/24680 https://github.com/pytorch/pytorch/issues/30986

This does not fix https://github.com/pytorch/pytorch/issues/29984, it will be fixed in later PR.

Most of this PR is just following the same logic inside TH and THC except the handle of n-dimensional zero-sized tensor, in specific the case:
```
(m,).addmv((m, 0), (0,), beta, alpha)
```

#  Legacy code bugs and how this PR deal with it

The above case is a case where BLAS often have a mismatch of semantics with PyTorch: For BLAS and cuBLAS, the above is a noop, but for PyTorch, it is a scalar-vector multiplication `output = beta * input`. The handle of this case is already very poor in legacy code and it is poorly tested:

For the CPU implementation, there are two code paths:
- Path 1: when dtype is float or double and `USE_BLAS`, then use BLAS
- Path 2: when other dtypes or not `USE_BLAS`, use a fallback kernel in PyTorch

For the CUDA implementation, there are also two code paths:
- Path 1: when float or double, then use `cublasSgemv` or `cublasDgemv` in cuBlas
- Path 2: when half, dispatch to `addmm`

`test_blas_alpha_beta_empty` is supposed to cover all cases, but unfortunately, it only tests the Path 1 of CUDA and Path 1 of CPU, and both uncovered paths (path 2 for CPU and path 2 for CUDA) are buggy in legacy code. In this PR, I expanded the coverage of `test_blas_alpha_beta_empty`, but unfortunately, I have to skip the `half` dtype on CUDA 9. See the description below for detail:

## Bug on CPU implementation

For the CPU implementation, the fallback kernel in path 2 already has the same semantics as PyTorch, not BLAS. But the code that tries to correct BLAS semantics to match PyTorch also runs on this case, leading to double correction, that is, `output = beta * input` now becomes `output = beta * beta * input`.

This leads to the issue https://github.com/pytorch/pytorch/issues/30986 I just opened, and it is fixed in this PR.

## Bug on CUDA implementation

For the CUDA implementation, path 2 dispatches to
```
(m, 1).addmm((m, 0), (0, 1), beta, alpha)
```
But unfortunately, for some old CUDA version when on old GPU on half dtype, the above is also noop, which is definitely not correct.

But from what I see, on newer CUDA version or newer GPU, this is not a problem. This is a bug of PyTorch in `addmm`, so I opened a new issue https://github.com/pytorch/pytorch/issues/31006 to track this problem. But this is highly likely a dependency bug for PyTorch originating from cuBLAS, and it is only on a rarely used edge case on old hardware and software, so this issue would be a `won't_fix` unless some real requirements strongly indicate that this should be fixed.

This issue is already with legacy code, and this PR does not make it worse. To prevent this issue from bothering us, I disable the test of `half` dtype for CUDA 9 when expanding the coverage of `test_blas_alpha_beta_empty`.

I promote a CircleCI CUDA 10.1 test to `XImportant` so that it runs on PRs, because the path 2 of CUDA implementation is only covered by this configuration. Let me know if I should revert this change.

## An additional problem

In legacy code for `addmv`, dtype `bfloat16` is enabled and dispatch to `addmm`, but `addmm` does not support `bfloat16` from what I test. I do the same thing in the new code. Let me know if I should do it differently.

# Benchmark

Code:
```python
import torch
print(torch.__version__)

for i in range(1000):
    torch.arange(i, device='cuda')

print('cpu')
for i in 10, 100, 1000, 10000:
    a = torch.randn((i,))
    b = torch.randn((i, i))
    c = torch.randn((i,))
    %timeit a.addmv(b, c, alpha=1, beta=2)

print('cuda')
for i in 10, 100, 1000, 10000:
    a = torch.randn((i,)).cuda()
    b = torch.randn((i, i)).cuda()
    c = torch.randn((i,)).cuda()
    torch.cuda.synchronize()
    %timeit a.addmv(b, c, alpha=1, beta=2); torch.cuda.synchronize()
```

Before:
```
1.5.0a0+2b45368
cpu
2.74 µs ± 30.8 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
8.5 µs ± 85.5 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
686 µs ± 2.95 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
74 ms ± 410 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
cuda
The slowest run took 4.81 times longer than the fastest. This could mean that an intermediate result is being cached.
27.6 µs ± 23 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
17.3 µs ± 151 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
20.5 µs ± 369 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
756 µs ± 6.81 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
```

After:
```
1.5.0a0+66b4034
cpu
3.29 µs ± 20 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
9.09 µs ± 7.41 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
687 µs ± 7.01 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
73.8 ms ± 453 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
cuda
18.2 µs ± 478 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
17.7 µs ± 299 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
21.5 µs ± 2.38 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
751 µs ± 35.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30898

Differential Revision: D20660338

Pulled By: anjali411

fbshipit-source-id: db1f521f124198f63545064026f93fcb16b68f18
2020-04-23 06:56:49 -07:00
0dd21c3b72 Lets @dtypes take tuples of dtypes (#36908)
Summary:
Lets dtypes take tuples of dtypes instead of just single dtypes. This pattern comes up when tests have distinct in and out types. A test in test_type_promotion is updated to use the new behavior.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36908

Differential Revision: D21161523

Pulled By: mruberry

fbshipit-source-id: ebac81c1b6c494a2146d595fcdb3e35c22cf859c
2020-04-23 02:28:20 -07:00
50a1850d8d [pytorch] Route default warning sync to LOG(WARNING) - second try (#36984)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36984

Follow LOG(WARNING) format for c++ side warnings in order to play well with larger services, especially when using glog. I need to hook up into GLOG internals a bit in order to override FILE/LINE without having to change the whole thing to be macros, but it seems to be stable between glog versions.

Note, this also changes caffe2_log_level to warning by default - I think it's a much better default when compiling without glog (or maybe even have info).

With glog output, stderr capture doesn't work any more in tests. That's why we instead use c10-level warnings capture.

Test Plan:
Run unittest in both glog and non-glog build mode:

glog:
```
W0416 12:06:49.778215 3311666 exception_test.cpp:23] Warning: I'm a warning (function TestBody)
```

no-glog:
```
[W exception_test.cpp:23] Warning: I'm a warning (function TestBody)
```

Reviewed By: ilia-cher

Differential Revision: D21151351

fbshipit-source-id: fa926d9e480db5ff696990dad3d80f79ef79f24a
2020-04-23 01:08:00 -07:00
b889e0da8a [torch] Excluding test_fft_input_modification without MKL (#36680)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36680

If torch compiled without MKL, this test fails with torch.fft requiring MKL support

Test Plan: CI

Reviewed By: malfet

Differential Revision: D21051362

fbshipit-source-id: dd2e2c7d323622c1c25fc4c817b85d83d2241b3a
2020-04-22 21:58:02 -07:00
355cafde26 [ROCm] Don't use MIOpen for tensors with more than INT_MAX number of elements (#37110)
Summary:
This pull request extends the fallback implemented in https://github.com/pytorch/pytorch/issues/31383 to not use MIOpen for tensors where number of elements in a tensor exceeds INT_MAX. The PR also enables the corresponding test in TestNN

cc: ezyang jeffdaily
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37110

Differential Revision: D21196336

Pulled By: ezyang

fbshipit-source-id: 25fd80308a0e2f7941c249735674ebc85d3fd39e
2020-04-22 21:20:53 -07:00
f46231a2f4 Revert D21144940: [pytorch][PR] ci: Change file_diff_from_base to be dynamic
Test Plan: revert-hammer

Differential Revision:
D21144940

Original commit changeset: ec6d1c2adcf7

fbshipit-source-id: e4f2d37b7e043aadc210ef45b2bba6cf859aeed3
2020-04-22 21:09:09 -07:00
f771c96852 Returns float from complex angle (#36896)
Summary:
Updates angle to return a float tensor, by default, when given complex inputs. This behavior is compatible with Python, NumPy, and C++. The implementation follows the former implementation for complex abs, extracting the logic into a common function for both abs and angle.

The test for complex abs's behavior in test_type_promotion.py is updated to also test the behavior of complex angle by comparing its results to NumPy's.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36896

Differential Revision: D21170589

Pulled By: mruberry

fbshipit-source-id: f5a634aea351dd58a8376f1474fc5a6422038cbf
2020-04-22 19:51:15 -07:00
45706bf6d8 properly whitelist clang-format in CI (#37122)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/37122

Test Plan: Imported from OSS

Reviewed By: shannonzhu

Differential Revision: D21194505

Pulled By: suo

fbshipit-source-id: d756c5291535ac1aefbbad57b38f324e42e2c2f7
2020-04-22 19:24:22 -07:00
7c7cb74887 Add missing ${CMAKE_CURRENT_SOURCE_DIR}/complex_test.cpp (#37080)
Summary:
This test is never built in OSS CI
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37080

Differential Revision: D21179296

Pulled By: anjali411

fbshipit-source-id: 22a5b82f17676213c8ec51642bef35dc61f9cace
2020-04-22 19:22:59 -07:00
ca665c682c Separate RTLD_GLOBAL from _load_global_deps() (#36682)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36682

For fb internal builds we need to separate whether to use global deps library from loading with RTLD_GLOBAL.

Test Plan: CI -- this should be a no-op for existing builds

Reviewed By: ezyang

Differential Revision: D21051427

fbshipit-source-id: 83bb703d6ceb0265a4c58166749312a44172e78c
2020-04-22 19:08:44 -07:00
d0291df7d9 [resubmit] Rebase xla job on top master before running CI build. (#37085)
Summary:
Resubmit of https://github.com/pytorch/pytorch/issues/36852
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37085

Differential Revision: D21180010

Pulled By: ailzhang

fbshipit-source-id: c448b836e4c13b15860e0b69da76082bd644badd
2020-04-22 18:46:21 -07:00
e557b7cec2 Kill BC hack in torchbind (#37112)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/37112

Test Plan: Imported from OSS

Differential Revision: D21189654

Pulled By: jamesr66a

fbshipit-source-id: 83469d4f81cdecc2897594e80e7b84047239e10a
2020-04-22 17:36:18 -07:00
4ab46f6baf [pytorch] Delete unneeded scripts
Summary: These aren't needed

Test Plan: Look closely

Differential Revision: D21191456

fbshipit-source-id: d9921afb5363106406a0f6432612586ff4be4290
2020-04-22 17:23:52 -07:00
de090c42b1 Optimize binary size of assert macros (#37023)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37023

Optimize binary size of assert macros, through two ideas:

Concatenate string literals with __FILE__ and __LINE__ at compile time into one literal instead of keeping them in separate literals and combining them with c10::str
Optimize binary size of c10::str for some scenarios, especially for the scenario where it is called with an empty parameter list, this is actually a common call scenario in assert macros.
In server oss builds, this PR reduces binary size from 118.05 MB to 117.05 MB
ghstack-source-id: 102607237

Test Plan: Run oss server build (python setup.py install) and check size of libtorch_cpu.so reducing from 118.05MB to 117.05MB

Differential Revision: D20719400

fbshipit-source-id: 5c61f4195b947f06aafb8f0c8e255de3366e1ff2
2020-04-22 17:13:17 -07:00
7f50162d1e quantized activations: clean up more unneeded quantizations (#36981)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36981

Replaces unneeded quantize calls for remaining quantized
activations with empty tensor creation.
Should be a perf win for anyone who uses these.

Test Plan:
python test/quantization/test_quantized.py TestQuantizedOps

Imported from OSS

Differential Revision: D21185969

fbshipit-source-id: 473b2b8aa40046ea3f0665bd45b03f09e8a7d572
2020-04-22 16:17:08 -07:00
2773ed3082 hardswish: remove unnecessary quantize call (#36980)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36980

Missed this on the original diff, fixing.  Create the output tensor directly instead of quantizing it.

Test Plan:
tests still pass
microbenchmarks show a 2x performance improvment for int8:
https://gist.github.com/vkuzo/3b321b428e4c38e805000961c263286b (this
will depend on input size)

Imported from OSS

Differential Revision: D21185970

fbshipit-source-id: 5b9e93d9f9ac05a8120532bd03ad347541a132c2
2020-04-22 16:15:54 -07:00
6fcabf619d [takeover] BTRS algorithm for fast/efficient binomial sampling (#36858)
Summary:
The original PR is https://github.com/pytorch/pytorch/pull/31278.

CC: ezyang jamestwebber fritzo zasdfgbnm

 ---

<!-- # This PR - CPU
In [1]: import torch; import torch.distributions as dist

In [2]: counts = torch.randint(10, 1000, [1000,1000])
   ...: p = 0.5 * torch.ones(1000, 1000)

In [3]: %timeit dist.binomial.Binomial(total_count=counts, probs=p).sample()
94.8 ms ± 911 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
-->
```
# This PR - GPU
In [1]: import torch; import torch.distributions as dist

In [2]: counts = torch.randint(10, 1000, [1000,1000]).cuda(); p = 0.5 * torch.ones(1000, 1000).cuda()

In [3]:  %timeit dist.binomial.Binomial(total_count=counts, probs=p).sample()
737 µs ± 216 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)

# master (commit: 806f22b167c74897cf67c0828b528fa3e4e6d6de) - GPU
In [5]: counts = torch.randint(10, 1000, [1000,1000]).cuda(); p = 0.5 * torch.ones(1000, 1000).cuda()

In [6]: %timeit dist.binomial.Binomial(total_count=counts, probs=p).sample()
46.3 ms ± 76.2 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36858

Differential Revision: D21178367

Pulled By: ezyang

fbshipit-source-id: 7e7d6f463e35b07156d69bd7452040b2f9c2eb7a
2020-04-22 15:53:41 -07:00
baaa0943f1 Update third_party/cpuinfo to include a fix for conda builds, older kernels (#37083)
Summary:
Includes a single bug fix, https://github.com/pytorch/cpuinfo/pull/37, that showed up as a build error for undefined `__NR_getcpu`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37083

Differential Revision: D21187439

Pulled By: ezyang

fbshipit-source-id: f34be3937b2cb6c6d1c40f86098ef59f17781d66
2020-04-22 15:23:48 -07:00
8d6a8d2b3f Fix DDP bug in single process multiple device use cases (#36503)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/36503

Test Plan: Imported from OSS

Differential Revision: D21179274

Pulled By: mrshenli

fbshipit-source-id: 0afce30ae0ddda753d1e240584a0f80df9aec4c2
2020-04-22 15:06:28 -07:00
efcbcca454 Revert D21138687: [pytorch][PR] Added complex dtypes to get_all_math_dtypes, complex acc type for cpu, fixed rdiv and pow for complex
Test Plan: revert-hammer

Differential Revision:
D21138687

Original commit changeset: ad3602ccf86c

fbshipit-source-id: 69eb031c1a7c3d5e4b9f4241fbdada8d5980535d
2020-04-22 14:49:45 -07:00
78d5707041 Fix type annotations and make MyPy run on torch/ (#36584)
Summary:
This PR fixes a couple of syntax errors in `torch/` that prevent MyPy from running, fixes simple type annotation errors (e.g. missing `from typing import List, Tuple, Optional`), and adds granular ignores for errors in particular modules as well as for missing typing in third party packages.

As a result, running `mypy` in the root dir of the repo now runs on:
- `torch/`
- `aten/src/ATen/function_wrapper.py` (the only file already covered in CI)

In CI this runs on GitHub Actions, job Lint, sub-job "quick-checks", task "MyPy typecheck". It should give (right now): `Success: no issues found in 329 source files`.

Here are the details of the original 855 errors when running `mypy torch` on current master (after fixing the couple of syntax errors that prevent `mypy` from running through):

<details>

```
torch/utils/tensorboard/_proto_graph.py:1: error: Cannot find implementation or library stub for module named 'tensorboard.compat.proto.node_def_pb2'
torch/utils/tensorboard/_proto_graph.py:2: error: Cannot find implementation or library stub for module named 'tensorboard.compat.proto.attr_value_pb2'
torch/utils/tensorboard/_proto_graph.py:3: error: Cannot find implementation or library stub for module named 'tensorboard.compat.proto.tensor_shape_pb2'
torch/utils/backcompat/__init__.py:1: error: Cannot find implementation or library stub for module named 'torch._C'
torch/for_onnx/__init__.py:1: error: Cannot find implementation or library stub for module named 'torch.for_onnx.onnx'
torch/cuda/nvtx.py:2: error: Cannot find implementation or library stub for module named 'torch._C'
torch/utils/show_pickle.py:59: error: Name 'pickle._Unpickler' is not defined
torch/utils/show_pickle.py:113: error: "Type[PrettyPrinter]" has no attribute "_dispatch"
torch/utils/tensorboard/_onnx_graph.py:1: error: Cannot find implementation or library stub for module named 'tensorboard.compat.proto.graph_pb2'
torch/utils/tensorboard/_onnx_graph.py:2: error: Cannot find implementation or library stub for module named 'tensorboard.compat.proto.node_def_pb2'
torch/utils/tensorboard/_onnx_graph.py:3: error: Cannot find implementation or library stub for module named 'tensorboard.compat.proto.versions_pb2'
torch/utils/tensorboard/_onnx_graph.py:4: error: Cannot find implementation or library stub for module named 'tensorboard.compat.proto.attr_value_pb2'
torch/utils/tensorboard/_onnx_graph.py:5: error: Cannot find implementation or library stub for module named 'tensorboard.compat.proto.tensor_shape_pb2'
torch/utils/tensorboard/_onnx_graph.py:9: error: Cannot find implementation or library stub for module named 'onnx'
torch/contrib/_tensorboard_vis.py:10: error: Cannot find implementation or library stub for module named 'tensorflow.core.util'
torch/contrib/_tensorboard_vis.py:11: error: Cannot find implementation or library stub for module named 'tensorflow.core.framework'
torch/contrib/_tensorboard_vis.py:12: error: Cannot find implementation or library stub for module named 'tensorflow.python.summary.writer.writer'
torch/utils/hipify/hipify_python.py:43: error: Need type annotation for 'CAFFE2_TEMPLATE_MAP' (hint: "CAFFE2_TEMPLATE_MAP: Dict[<type>, <type>] = ...")
torch/utils/hipify/hipify_python.py:636: error: "object" has no attribute "items"
torch/nn/_reduction.py:27: error: Name 'Optional' is not defined
torch/nn/_reduction.py:27: note: Did you forget to import it from "typing"? (Suggestion: "from typing import Optional")
torch/nn/_reduction.py:47: error: Name 'Optional' is not defined
torch/nn/_reduction.py:47: note: Did you forget to import it from "typing"? (Suggestion: "from typing import Optional")
torch/utils/tensorboard/_utils.py:17: error: Skipping analyzing 'matplotlib.pyplot': found module but no type hints or library stubs
torch/utils/tensorboard/_utils.py:17: error: Skipping analyzing 'matplotlib': found module but no type hints or library stubs
torch/utils/tensorboard/_utils.py:18: error: Skipping analyzing 'matplotlib.backends.backend_agg': found module but no type hints or library stubs
torch/utils/tensorboard/_utils.py:18: error: Skipping analyzing 'matplotlib.backends': found module but no type hints or library stubs
torch/nn/modules/utils.py:27: error: Name 'List' is not defined
torch/nn/modules/utils.py:27: note: Did you forget to import it from "typing"? (Suggestion: "from typing import List")
caffe2/proto/caffe2_pb2.py:17: error: Unexpected keyword argument "serialized_options" for "FileDescriptor"; did you mean "serialized_pb"?
caffe2/proto/caffe2_pb2.py:25: error: Unexpected keyword argument "serialized_options" for "EnumDescriptor"
caffe2/proto/caffe2_pb2.py:31: error: Unexpected keyword argument "serialized_options" for "EnumValueDescriptor"
caffe2/proto/caffe2_pb2.py:35: error: Unexpected keyword argument "serialized_options" for "EnumValueDescriptor"
caffe2/proto/caffe2_pb2.py:39: error: Unexpected keyword argument "serialized_options" for "EnumValueDescriptor"
caffe2/proto/caffe2_pb2.py:43: error: Unexpected keyword argument "serialized_options" for "EnumValueDescriptor"
caffe2/proto/caffe2_pb2.py:47: error: Unexpected keyword argument "serialized_options" for "EnumValueDescriptor"
caffe2/proto/caffe2_pb2.py:51: error: Unexpected keyword argument "serialized_options" for "EnumValueDescriptor"
caffe2/proto/caffe2_pb2.py:55: error: Unexpected keyword argument "serialized_options" for "EnumValueDescriptor"
caffe2/proto/caffe2_pb2.py:59: error: Unexpected keyword argument "serialized_options" for "EnumValueDescriptor"
caffe2/proto/caffe2_pb2.py:63: error: Unexpected keyword argument "serialized_options" for "EnumValueDescriptor"
caffe2/proto/caffe2_pb2.py:67: error: Unexpected keyword argument "serialized_options" for "EnumValueDescriptor"
caffe2/proto/caffe2_pb2.py:71: error: Unexpected keyword argument "serialized_options" for "EnumValueDescriptor"
caffe2/proto/caffe2_pb2.py:75: error: Unexpected keyword argument "serialized_options" for "EnumValueDescriptor"
caffe2/proto/caffe2_pb2.py:102: error: Unexpected keyword argument "serialized_options" for "EnumDescriptor"
caffe2/proto/caffe2_pb2.py:108: error: Unexpected keyword argument "serialized_options" for "EnumValueDescriptor"
caffe2/proto/caffe2_pb2.py:112: error: Unexpected keyword argument "serialized_options" for "EnumValueDescriptor"
caffe2/proto/caffe2_pb2.py:124: error: Unexpected keyword argument "serialized_options" for "EnumDescriptor"
caffe2/proto/caffe2_pb2.py:130: error: Unexpected keyword argument "serialized_options" for "EnumValueDescriptor"
caffe2/proto/caffe2_pb2.py:134: error: Unexpected keyword argument "serialized_options" for "EnumValueDescriptor"
caffe2/proto/caffe2_pb2.py:138: error: Unexpected keyword argument "serialized_options" for "EnumValueDescriptor"
caffe2/proto/caffe2_pb2.py:142: error: Unexpected keyword argument "serialized_options" for "EnumValueDescriptor"
caffe2/proto/caffe2_pb2.py:146: error: Unexpected keyword argument "serialized_options" for "EnumValueDescriptor"
caffe2/proto/caffe2_pb2.py:150: error: Unexpected keyword argument "serialized_options" for "EnumValueDescriptor"
caffe2/proto/caffe2_pb2.py:154: error: Unexpected keyword argument "serialized_options" for "EnumValueDescriptor"
caffe2/proto/caffe2_pb2.py:158: error: Unexpected keyword argument "serialized_options" for "EnumValueDescriptor"
caffe2/proto/caffe2_pb2.py:162: error: Unexpected keyword argument "serialized_options" for "EnumValueDescriptor"
caffe2/proto/caffe2_pb2.py:166: error: Unexpected keyword argument "serialized_options" for "EnumValueDescriptor"
caffe2/proto/caffe2_pb2.py:170: error: Unexpected keyword argument "serialized_options" for "EnumValueDescriptor"
caffe2/proto/caffe2_pb2.py:174: error: Unexpected keyword argument "serialized_options" for "EnumValueDescriptor"
caffe2/proto/caffe2_pb2.py:178: error: Unexpected keyword argument "serialized_options" for "EnumValueDescriptor"
caffe2/proto/caffe2_pb2.py:182: error: Unexpected keyword argument "serialized_options" for "EnumValueDescriptor"
caffe2/proto/caffe2_pb2.py:194: error: Unexpected keyword argument "serialized_options" for "EnumDescriptor"
caffe2/proto/caffe2_pb2.py:200: error: Unexpected keyword argument "serialized_options" for "EnumValueDescriptor"
caffe2/proto/caffe2_pb2.py:204: error: Unexpected keyword argument "serialized_options" for "EnumValueDescriptor"
caffe2/proto/caffe2_pb2.py:208: error: Unexpected keyword argument "serialized_options" for "EnumValueDescriptor"
caffe2/proto/caffe2_pb2.py:212: error: Unexpected keyword argument "serialized_options" for "EnumValueDescriptor"
caffe2/proto/caffe2_pb2.py:224: error: Unexpected keyword argument "serialized_options" for "EnumDescriptor"
caffe2/proto/caffe2_pb2.py:230: error: Unexpected keyword argument "serialized_options" for "EnumValueDescriptor"
caffe2/proto/caffe2_pb2.py:234: error: Unexpected keyword argument "serialized_options" for "EnumValueDescriptor"
caffe2/proto/caffe2_pb2.py:238: error: Unexpected keyword argument "serialized_options" for "EnumValueDescriptor"
caffe2/proto/caffe2_pb2.py:242: error: Unexpected keyword argument "serialized_options" for "EnumValueDescriptor"
caffe2/proto/caffe2_pb2.py:246: error: Unexpected keyword argument "serialized_options" for "EnumValueDescriptor"
caffe2/proto/caffe2_pb2.py:250: error: Unexpected keyword argument "serialized_options" for "EnumValueDescriptor"
caffe2/proto/caffe2_pb2.py:254: error: Unexpected keyword argument "serialized_options" for "EnumValueDescriptor"
caffe2/proto/caffe2_pb2.py:267: error: Unexpected keyword argument "serialized_options" for "Descriptor"
caffe2/proto/caffe2_pb2.py:274: error: Unexpected keyword argument "serialized_options" for "FieldDescriptor"
caffe2/proto/caffe2_pb2.py:281: error: Unexpected keyword argument "serialized_options" for "FieldDescriptor"
caffe2/proto/caffe2_pb2.py:288: error: Unexpected keyword argument "serialized_options" for "FieldDescriptor"
caffe2/proto/caffe2_pb2.py:295: error: Unexpected keyword argument "serialized_options" for "FieldDescriptor"
caffe2/proto/caffe2_pb2.py:302: error: Unexpected keyword argument "serialized_options" for "FieldDescriptor"
caffe2/proto/caffe2_pb2.py:327: error: Unexpected keyword argument "serialized_options" for "Descriptor"
caffe2/proto/caffe2_pb2.py:334: error: Unexpected keyword argument "serialized_options" for "FieldDescriptor"
caffe2/proto/caffe2_pb2.py:341: error: Unexpected keyword argument "serialized_options" for "FieldDescriptor"
caffe2/proto/caffe2_pb2.py:364: error: Unexpected keyword argument "serialized_options" for "Descriptor"
caffe2/proto/caffe2_pb2.py:371: error: Unexpected keyword argument "serialized_options" for "FieldDescriptor"
caffe2/proto/caffe2_pb2.py:378: error: Unexpected keyword argument "serialized_options" for "FieldDescriptor"
caffe2/proto/caffe2_pb2.py:385: error: Unexpected keyword argument "serialized_options" for "FieldDescriptor"
caffe2/proto/caffe2_pb2.py:392: error: Unexpected keyword argument "serialized_options" for "FieldDescriptor"
caffe2/proto/caffe2_pb2.py:399: error: Unexpected keyword argument "serialized_options" for "FieldDescriptor"
caffe2/proto/caffe2_pb2.py:406: error: Unexpected keyword argument "serialized_options" for "FieldDescriptor"
caffe2/proto/caffe2_pb2.py:413: error: Unexpected keyword argument "serialized_options" for "FieldDescriptor"
caffe2/proto/caffe2_pb2.py:420: error: Unexpected keyword argument "serialized_options" for "FieldDescriptor"
caffe2/proto/caffe2_pb2.py:427: error: Unexpected keyword argument "serialized_options" for "FieldDescriptor"
caffe2/proto/caffe2_pb2.py:434: error: Unexpected keyword argument "serialized_options" for "FieldDescriptor"
caffe2/proto/caffe2_pb2.py:441: error: Unexpected keyword argument "serialized_options" for "FieldDescriptor"
caffe2/proto/caffe2_pb2.py:448: error: Unexpected keyword argument "serialized_options" for "FieldDescriptor"
caffe2/proto/caffe2_pb2.py:455: error: Unexpected keyword argument "serialized_options" for "FieldDescriptor"
caffe2/proto/caffe2_pb2.py:462: error: Unexpected keyword argument "serialized_options" for "FieldDescriptor"
caffe2/proto/caffe2_pb2.py:488: error: Unexpected keyword argument "serialized_options" for "Descriptor"
caffe2/proto/caffe2_pb2.py:495: error: Unexpected keyword argument "serialized_options" for "FieldDescriptor"
caffe2/proto/caffe2_pb2.py:502: error: Unexpected keyword argument "serialized_options" for "FieldDescriptor"
caffe2/proto/caffe2_pb2.py:509: error: Unexpected keyword argument "serialized_options" for "FieldDescriptor"
caffe2/proto/caffe2_pb2.py:516: error: Unexpected keyword argument "serialized_options" for "FieldDescriptor"
caffe2/proto/caffe2_pb2.py:523: error: Unexpected keyword argument "serialized_options" for "FieldDescriptor"
caffe2/proto/caffe2_pb2.py:530: error: Unexpected keyword argument "serialized_options" for "FieldDescriptor"
caffe2/proto/caffe2_pb2.py:537: error: Unexpected keyword argument "serialized_options" for "FieldDescriptor"
caffe2/proto/caffe2_pb2.py:544: error: Unexpected keyword argument "serialized_options" for "FieldDescriptor"
caffe2/proto/caffe2_pb2.py:551: error: Unexpected keyword argument "serialized_options" for "FieldDescriptor"
caffe2/proto/caffe2_pb2.py:558: error: Unexpected keyword argument "serialized_options" for "FieldDescriptor"
caffe2/proto/caffe2_pb2.py:565: error: Unexpected keyword argument "serialized_options" for "FieldDescriptor"
caffe2/proto/caffe2_pb2.py:572: error: Unexpected keyword argument "serialized_options" for "FieldDescriptor"
caffe2/proto/caffe2_pb2.py:596: error: Unexpected keyword argument "serialized_options" for "Descriptor"
caffe2/proto/caffe2_pb2.py:603: error: Unexpected keyword argument "serialized_options" for "FieldDescriptor"
caffe2/proto/caffe2_pb2.py:627: error: Unexpected keyword argument "serialized_options" for "Descriptor"
caffe2/proto/caffe2_pb2.py:634: error: Unexpected keyword argument "serialized_options" for "FieldDescriptor"
caffe2/proto/caffe2_pb2.py:641: error: Unexpected keyword argument "serialized_options" for "FieldDescriptor"
caffe2/proto/caffe2_pb2.py:648: error: Unexpected keyword argument "serialized_options" for "FieldDescriptor"
caffe2/proto/caffe2_pb2.py:655: error: Unexpected keyword argument "serialized_options" for "FieldDescriptor"
caffe2/proto/caffe2_pb2.py:662: error: Unexpected keyword argument "serialized_options" for "FieldDescriptor"
caffe2/proto/caffe2_pb2.py:686: error: Unexpected keyword argument "serialized_options" for "Descriptor"
caffe2/proto/caffe2_pb2.py:693: error: Unexpected keyword argument "serialized_options" for "FieldDescriptor"
caffe2/proto/caffe2_pb2.py:717: error: Unexpected keyword argument "serialized_options" for "Descriptor"
caffe2/proto/caffe2_pb2.py:724: error: Unexpected keyword argument "serialized_options" for "FieldDescriptor"
caffe2/proto/caffe2_pb2.py:731: error: Unexpected keyword argument "serialized_options" for "FieldDescriptor"
caffe2/proto/caffe2_pb2.py:738: error: Unexpected keyword argument "serialized_options" for "FieldDescriptor"
caffe2/proto/caffe2_pb2.py:763: error: Unexpected keyword argument "serialized_options" for "Descriptor"
caffe2/proto/caffe2_pb2.py:770: error: Unexpected keyword argument "serialized_options" for "FieldDescriptor"
caffe2/proto/caffe2_pb2.py:777: error: Unexpected keyword argument "serialized_options" for "FieldDescriptor"
caffe2/proto/caffe2_pb2.py:784: error: Unexpected keyword argument "serialized_options" for "FieldDescriptor"
caffe2/proto/caffe2_pb2.py:808: error: Unexpected keyword argument "serialized_options" for "Descriptor"
caffe2/proto/caffe2_pb2.py:815: error: Unexpected keyword argument "serialized_options" for "FieldDescriptor"
caffe2/proto/caffe2_pb2.py:822: error: Unexpected keyword argument "serialized_options" for "FieldDescriptor"
caffe2/proto/caffe2_pb2.py:829: error: Unexpected keyword argument "serialized_options" for "FieldDescriptor"
caffe2/proto/caffe2_pb2.py:836: error: Unexpected keyword argument "serialized_options" for "FieldDescriptor"
caffe2/proto/caffe2_pb2.py:843: error: Unexpected keyword argument "serialized_options" for "FieldDescriptor"
caffe2/proto/caffe2_pb2.py:850: error: Unexpected keyword argument "serialized_options" for "FieldDescriptor"
caffe2/proto/caffe2_pb2.py:857: error: Unexpected keyword argument "serialized_options" for "FieldDescriptor"
caffe2/proto/caffe2_pb2.py:864: error: Unexpected keyword argument "serialized_options" for "FieldDescriptor"
caffe2/proto/caffe2_pb2.py:871: error: Unexpected keyword argument "serialized_options" for "FieldDescriptor"
caffe2/proto/caffe2_pb2.py:878: error: Unexpected keyword argument "serialized_options" for "FieldDescriptor"
caffe2/proto/caffe2_pb2.py:885: error: Unexpected keyword argument "serialized_options" for "FieldDescriptor"
caffe2/proto/caffe2_pb2.py:892: error: Unexpected keyword argument "serialized_options" for "FieldDescriptor"
caffe2/proto/caffe2_pb2.py:916: error: Unexpected keyword argument "serialized_options" for "Descriptor"
caffe2/proto/caffe2_pb2.py:923: error: Unexpected keyword argument "serialized_options" for "FieldDescriptor"
caffe2/proto/caffe2_pb2.py:930: error: Unexpected keyword argument "serialized_options" for "FieldDescriptor"
caffe2/proto/caffe2_pb2.py:937: error: Unexpected keyword argument "serialized_options" for "FieldDescriptor"
caffe2/proto/caffe2_pb2.py:944: error: Unexpected keyword argument "serialized_options" for "FieldDescriptor"
caffe2/proto/caffe2_pb2.py:951: error: Unexpected keyword argument "serialized_options" for "FieldDescriptor"
caffe2/proto/caffe2_pb2.py:958: error: Unexpected keyword argument "serialized_options" for "FieldDescriptor"
caffe2/proto/caffe2_pb2.py:982: error: Unexpected keyword argument "serialized_options" for "Descriptor"
caffe2/proto/caffe2_pb2.py:989: error: Unexpected keyword argument "serialized_options" for "FieldDescriptor"
caffe2/proto/caffe2_pb2.py:996: error: Unexpected keyword argument "serialized_options" for "FieldDescriptor"
caffe2/proto/caffe2_pb2.py:1003: error: Unexpected keyword argument "serialized_options" for "FieldDescriptor"
caffe2/proto/caffe2_pb2.py:1010: error: Unexpected keyword argument "serialized_options" for "FieldDescriptor"
caffe2/proto/caffe2_pb2.py:1017: error: Unexpected keyword argument "serialized_options" for "FieldDescriptor"
caffe2/proto/caffe2_pb2.py:1024: error: Unexpected keyword argument "serialized_options" for "FieldDescriptor"
caffe2/proto/caffe2_pb2.py:1031: error: Unexpected keyword argument "serialized_options" for "FieldDescriptor"
caffe2/proto/caffe2_pb2.py:1038: error: Unexpected keyword argument "serialized_options" for "FieldDescriptor"
caffe2/proto/caffe2_pb2.py:1045: error: Unexpected keyword argument "serialized_options" for "FieldDescriptor"
caffe2/proto/caffe2_pb2.py:1052: error: Unexpected keyword argument "serialized_options" for "FieldDescriptor"
caffe2/proto/caffe2_pb2.py:1059: error: Unexpected keyword argument "serialized_options" for "FieldDescriptor"
caffe2/proto/caffe2_pb2.py:1066: error: Unexpected keyword argument "serialized_options" for "FieldDescriptor"
caffe2/proto/caffe2_pb2.py:1090: error: Unexpected keyword argument "serialized_options" for "Descriptor"
caffe2/proto/caffe2_pb2.py:1097: error: Unexpected keyword argument "serialized_options" for "FieldDescriptor"
caffe2/proto/caffe2_pb2.py:1104: error: Unexpected keyword argument "serialized_options" for "FieldDescriptor"
caffe2/proto/caffe2_pb2.py:1128: error: Unexpected keyword argument "serialized_options" for "Descriptor"
caffe2/proto/caffe2_pb2.py:1135: error: Unexpected keyword argument "serialized_options" for "FieldDescriptor"
caffe2/proto/caffe2_pb2.py:1142: error: Unexpected keyword argument "serialized_options" for "FieldDescriptor"
caffe2/proto/caffe2_pb2.py:1166: error: Unexpected keyword argument "serialized_options" for "Descriptor"
caffe2/proto/caffe2_pb2.py:1173: error: Unexpected keyword argument "serialized_options" for "FieldDescriptor"
caffe2/proto/caffe2_pb2.py:1180: error: Unexpected keyword argument "serialized_options" for "FieldDescriptor"
caffe2/proto/caffe2_pb2.py:1187: error: Unexpected keyword argument "serialized_options" for "FieldDescriptor"
caffe2/proto/caffe2_pb2.py:1194: error: Unexpected keyword argument "serialized_options" for "FieldDescriptor"
caffe2/proto/caffe2_pb2.py:1218: error: Unexpected keyword argument "serialized_options" for "Descriptor"
caffe2/proto/caffe2_pb2.py:1225: error: Unexpected keyword argument "serialized_options" for "FieldDescriptor"
caffe2/proto/caffe2_pb2.py:1232: error: Unexpected keyword argument "serialized_options" for "FieldDescriptor"
caffe2/proto/caffe2_pb2.py:1239: error: Unexpected keyword argument "serialized_options" for "FieldDescriptor"
caffe2/proto/caffe2_pb2.py:1246: error: Unexpected keyword argument "serialized_options" for "FieldDescriptor"
caffe2/proto/caffe2_pb2.py:1253: error: Unexpected keyword argument "serialized_options" for "FieldDescriptor"
caffe2/proto/caffe2_pb2.py:1260: error: Unexpected keyword argument "serialized_options" for "FieldDescriptor"
caffe2/proto/caffe2_pb2.py:1267: error: Unexpected keyword argument "serialized_options" for "FieldDescriptor"
caffe2/proto/caffe2_pb2.py:1274: error: Unexpected keyword argument "serialized_options" for "FieldDescriptor"
caffe2/proto/caffe2_pb2.py:1281: error: Unexpected keyword argument "serialized_options" for "FieldDescriptor"
caffe2/proto/caffe2_pb2.py:1305: error: Unexpected keyword argument "serialized_options" for "Descriptor"
caffe2/proto/caffe2_pb2.py:1312: error: Unexpected keyword argument "serialized_options" for "FieldDescriptor"
caffe2/proto/caffe2_pb2.py:1319: error: Unexpected keyword argument "serialized_options" for "FieldDescriptor"
caffe2/proto/caffe2_pb2.py:1326: error: Unexpected keyword argument "serialized_options" for "FieldDescriptor"
caffe2/proto/caffe2_pb2.py:1333: error: Unexpected keyword argument "serialized_options" for "FieldDescriptor"
caffe2/proto/caffe2_pb2.py:1340: error: Unexpected keyword argument "serialized_options" for "FieldDescriptor"
caffe2/proto/caffe2_pb2.py:1347: error: Unexpected keyword argument "serialized_options" for "FieldDescriptor"
caffe2/proto/caffe2_pb2.py:1354: error: Unexpected keyword argument "serialized_options" for "FieldDescriptor"
caffe2/proto/caffe2_pb2.py:1361: error: Unexpected keyword argument "serialized_options" for "FieldDescriptor"
caffe2/proto/caffe2_pb2.py:1368: error: Unexpected keyword argument "serialized_options" for "FieldDescriptor"
caffe2/proto/caffe2_pb2.py:1375: error: Unexpected keyword argument "serialized_options" for "FieldDescriptor"
caffe2/proto/caffe2_pb2.py:1382: error: Unexpected keyword argument "serialized_options" for "FieldDescriptor"
caffe2/proto/caffe2_pb2.py:1389: error: Unexpected keyword argument "serialized_options" for "FieldDescriptor"
caffe2/proto/caffe2_pb2.py:1396: error: Unexpected keyword argument "serialized_options" for "FieldDescriptor"
caffe2/proto/caffe2_pb2.py:1420: error: Unexpected keyword argument "serialized_options" for "Descriptor"
caffe2/proto/caffe2_pb2.py:1427: error: Unexpected keyword argument "serialized_options" for "FieldDescriptor"
caffe2/proto/caffe2_pb2.py:1434: error: Unexpected keyword argument "serialized_options" for "FieldDescriptor"
caffe2/proto/caffe2_pb2.py:1441: error: Unexpected keyword argument "serialized_options" for "FieldDescriptor"
caffe2/proto/caffe2_pb2.py:1465: error: Unexpected keyword argument "serialized_options" for "Descriptor"
caffe2/proto/caffe2_pb2.py:1472: error: Unexpected keyword argument "serialized_options" for "FieldDescriptor"
caffe2/proto/caffe2_pb2.py:1479: error: Unexpected keyword argument "serialized_options" for "FieldDescriptor"
caffe2/proto/caffe2_pb2.py:1486: error: Unexpected keyword argument "serialized_options" for "FieldDescriptor"
caffe2/proto/caffe2_pb2.py:1493: error: Unexpected keyword argument "serialized_options" for "FieldDescriptor"
caffe2/proto/caffe2_pb2.py:1500: error: Unexpected keyword argument "serialized_options" for "FieldDescriptor"
caffe2/proto/caffe2_pb2.py:1507: error: Unexpected keyword argument "serialized_options" for "FieldDescriptor"
caffe2/proto/caffe2_pb2.py:1514: error: Unexpected keyword argument "serialized_options" for "FieldDescriptor"
caffe2/proto/caffe2_pb2.py:1538: error: Unexpected keyword argument "serialized_options" for "Descriptor"
caffe2/proto/caffe2_pb2.py:1545: error: Unexpected keyword argument "serialized_options" for "FieldDescriptor"
caffe2/proto/caffe2_pb2.py:1552: error: Unexpected keyword argument "serialized_options" for "FieldDescriptor"
caffe2/proto/caffe2_pb2.py:1559: error: Unexpected keyword argument "serialized_options" for "FieldDescriptor"
caffe2/proto/caffe2_pb2.py:1566: error: Unexpected keyword argument "serialized_options" for "FieldDescriptor"
caffe2/proto/caffe2_pb2.py:1667: error: "GeneratedProtocolMessageType" has no attribute "Segment"
torch/multiprocessing/queue.py:4: error: No library stub file for standard library module 'multiprocessing.reduction'
caffe2/proto/torch_pb2.py:18: error: Unexpected keyword argument "serialized_options" for "FileDescriptor"; did you mean "serialized_pb"?
caffe2/proto/torch_pb2.py:27: error: Unexpected keyword argument "serialized_options" for "EnumDescriptor"
caffe2/proto/torch_pb2.py:33: error: Unexpected keyword argument "serialized_options" for "EnumValueDescriptor"
caffe2/proto/torch_pb2.py:50: error: Unexpected keyword argument "serialized_options" for "Descriptor"
caffe2/proto/torch_pb2.py:57: error: Unexpected keyword argument "serialized_options" for "FieldDescriptor"
caffe2/proto/torch_pb2.py:81: error: Unexpected keyword argument "serialized_options" for "Descriptor"
caffe2/proto/torch_pb2.py:88: error: Unexpected keyword argument "serialized_options" for "FieldDescriptor"
caffe2/proto/torch_pb2.py:95: error: Unexpected keyword argument "serialized_options" for "FieldDescriptor"
caffe2/proto/torch_pb2.py:102: error: Unexpected keyword argument "serialized_options" for "FieldDescriptor"
caffe2/proto/torch_pb2.py:109: error: Unexpected keyword argument "serialized_options" for "FieldDescriptor"
caffe2/proto/torch_pb2.py:116: error: Unexpected keyword argument "serialized_options" for "FieldDescriptor"
caffe2/proto/torch_pb2.py:123: error: Unexpected keyword argument "serialized_options" for "FieldDescriptor"
caffe2/proto/torch_pb2.py:130: error: Unexpected keyword argument "serialized_options" for "FieldDescriptor"
caffe2/proto/torch_pb2.py:137: error: Unexpected keyword argument "serialized_options" for "FieldDescriptor"
caffe2/proto/torch_pb2.py:144: error: Unexpected keyword argument "serialized_options" for "FieldDescriptor"
caffe2/proto/torch_pb2.py:151: error: Unexpected keyword argument "serialized_options" for "FieldDescriptor"
caffe2/proto/torch_pb2.py:175: error: Unexpected keyword argument "serialized_options" for "Descriptor"
caffe2/proto/torch_pb2.py:182: error: Unexpected keyword argument "serialized_options" for "FieldDescriptor"
caffe2/proto/torch_pb2.py:189: error: Unexpected keyword argument "serialized_options" for "FieldDescriptor"
caffe2/proto/torch_pb2.py:196: error: Unexpected keyword argument "serialized_options" for "FieldDescriptor"
caffe2/proto/torch_pb2.py:220: error: Unexpected keyword argument "serialized_options" for "Descriptor"
caffe2/proto/torch_pb2.py:227: error: Unexpected keyword argument "serialized_options" for "FieldDescriptor"
caffe2/proto/torch_pb2.py:234: error: Unexpected keyword argument "serialized_options" for "FieldDescriptor"
caffe2/proto/torch_pb2.py:241: error: Unexpected keyword argument "serialized_options" for "FieldDescriptor"
caffe2/proto/torch_pb2.py:265: error: Unexpected keyword argument "serialized_options" for "Descriptor"
caffe2/proto/torch_pb2.py:272: error: Unexpected keyword argument "serialized_options" for "FieldDescriptor"
caffe2/proto/torch_pb2.py:279: error: Unexpected keyword argument "serialized_options" for "FieldDescriptor"
caffe2/proto/torch_pb2.py:286: error: Unexpected keyword argument "serialized_options" for "FieldDescriptor"
caffe2/proto/torch_pb2.py:293: error: Unexpected keyword argument "serialized_options" for "FieldDescriptor"
caffe2/proto/torch_pb2.py:300: error: Unexpected keyword argument "serialized_options" for "FieldDescriptor"
caffe2/proto/torch_pb2.py:307: error: Unexpected keyword argument "serialized_options" for "FieldDescriptor"
caffe2/proto/torch_pb2.py:314: error: Unexpected keyword argument "serialized_options" for "FieldDescriptor"
caffe2/proto/torch_pb2.py:321: error: Unexpected keyword argument "serialized_options" for "FieldDescriptor"
caffe2/proto/torch_pb2.py:328: error: Unexpected keyword argument "serialized_options" for "FieldDescriptor"
caffe2/proto/torch_pb2.py:335: error: Unexpected keyword argument "serialized_options" for "FieldDescriptor"
caffe2/proto/torch_pb2.py:342: error: Unexpected keyword argument "serialized_options" for "FieldDescriptor"
caffe2/proto/torch_pb2.py:366: error: Unexpected keyword argument "serialized_options" for "Descriptor"
caffe2/proto/torch_pb2.py:373: error: Unexpected keyword argument "serialized_options" for "FieldDescriptor"
caffe2/proto/torch_pb2.py:397: error: Unexpected keyword argument "serialized_options" for "Descriptor"
caffe2/proto/torch_pb2.py:404: error: Unexpected keyword argument "serialized_options" for "FieldDescriptor"
caffe2/proto/torch_pb2.py:411: error: Unexpected keyword argument "serialized_options" for "FieldDescriptor"
caffe2/proto/torch_pb2.py:418: error: Unexpected keyword argument "serialized_options" for "FieldDescriptor"
caffe2/proto/torch_pb2.py:425: error: Unexpected keyword argument "serialized_options" for "FieldDescriptor"
caffe2/proto/torch_pb2.py:432: error: Unexpected keyword argument "serialized_options" for "FieldDescriptor"
caffe2/proto/metanet_pb2.py:17: error: Unexpected keyword argument "serialized_options" for "FileDescriptor"; did you mean "serialized_pb"?
caffe2/proto/metanet_pb2.py:29: error: Unexpected keyword argument "serialized_options" for "Descriptor"
caffe2/proto/metanet_pb2.py:36: error: Unexpected keyword argument "serialized_options" for "FieldDescriptor"
caffe2/proto/metanet_pb2.py:43: error: Unexpected keyword argument "serialized_options" for "FieldDescriptor"
caffe2/proto/metanet_pb2.py:50: error: Unexpected keyword argument "serialized_options" for "FieldDescriptor"
caffe2/proto/metanet_pb2.py:57: error: Unexpected keyword argument "serialized_options" for "FieldDescriptor"
caffe2/proto/metanet_pb2.py:64: error: Unexpected keyword argument "serialized_options" for "FieldDescriptor"
caffe2/proto/metanet_pb2.py:88: error: Unexpected keyword argument "serialized_options" for "Descriptor"
caffe2/proto/metanet_pb2.py:95: error: Unexpected keyword argument "serialized_options" for "FieldDescriptor"
caffe2/proto/metanet_pb2.py:102: error: Unexpected keyword argument "serialized_options" for "FieldDescriptor"
caffe2/proto/metanet_pb2.py:126: error: Unexpected keyword argument "serialized_options" for "Descriptor"
caffe2/proto/metanet_pb2.py:133: error: Unexpected keyword argument "serialized_options" for "FieldDescriptor"
caffe2/proto/metanet_pb2.py:140: error: Unexpected keyword argument "serialized_options" for "FieldDescriptor"
caffe2/proto/metanet_pb2.py:164: error: Unexpected keyword argument "serialized_options" for "Descriptor"
caffe2/proto/metanet_pb2.py:171: error: Unexpected keyword argument "serialized_options" for "FieldDescriptor"
caffe2/proto/metanet_pb2.py:178: error: Unexpected keyword argument "serialized_options" for "FieldDescriptor"
caffe2/proto/metanet_pb2.py:202: error: Unexpected keyword argument "serialized_options" for "Descriptor"
caffe2/proto/metanet_pb2.py:209: error: Unexpected keyword argument "serialized_options" for "FieldDescriptor"
caffe2/proto/metanet_pb2.py:216: error: Unexpected keyword argument "serialized_options" for "FieldDescriptor"
caffe2/proto/metanet_pb2.py:240: error: Unexpected keyword argument "serialized_options" for "Descriptor"
caffe2/proto/metanet_pb2.py:247: error: Unexpected keyword argument "serialized_options" for "FieldDescriptor"
caffe2/proto/metanet_pb2.py:254: error: Unexpected keyword argument "serialized_options" for "FieldDescriptor"
caffe2/proto/metanet_pb2.py:261: error: Unexpected keyword argument "serialized_options" for "FieldDescriptor"
caffe2/proto/metanet_pb2.py:268: error: Unexpected keyword argument "serialized_options" for "FieldDescriptor"
caffe2/proto/metanet_pb2.py:275: error: Unexpected keyword argument "serialized_options" for "FieldDescriptor"
caffe2/proto/metanet_pb2.py:282: error: Unexpected keyword argument "serialized_options" for "FieldDescriptor"
caffe2/proto/metanet_pb2.py:289: error: Unexpected keyword argument "serialized_options" for "FieldDescriptor"
caffe2/proto/metanet_pb2.py:296: error: Unexpected keyword argument "serialized_options" for "FieldDescriptor"
caffe2/proto/__init__.py:13: error: Skipping analyzing 'caffe2.caffe2.fb.session.proto': found module but no type hints or library stubs
torch/multiprocessing/pool.py:3: error: No library stub file for standard library module 'multiprocessing.util'
torch/multiprocessing/pool.py:3: note: (Stub files are from https://github.com/python/typeshed)
caffe2/python/scope.py:10: error: Skipping analyzing 'past.builtins': found module but no type hints or library stubs
caffe2/python/__init__.py:7: error: Module has no attribute "CPU"
caffe2/python/__init__.py:8: error: Module has no attribute "CUDA"
caffe2/python/__init__.py:9: error: Module has no attribute "MKLDNN"
caffe2/python/__init__.py:10: error: Module has no attribute "OPENGL"
caffe2/python/__init__.py:11: error: Module has no attribute "OPENCL"
caffe2/python/__init__.py:12: error: Module has no attribute "IDEEP"
caffe2/python/__init__.py:13: error: Module has no attribute "HIP"
caffe2/python/__init__.py:14: error: Module has no attribute "COMPILE_TIME_MAX_DEVICE_TYPES"; maybe "PROTO_COMPILE_TIME_MAX_DEVICE_TYPES"?
caffe2/python/__init__.py:15: error: Module has no attribute "ONLY_FOR_TEST"; maybe "PROTO_ONLY_FOR_TEST"?
caffe2/python/__init__.py:34: error: Item "_Loader" of "Optional[_Loader]" has no attribute "exec_module"
caffe2/python/__init__.py:34: error: Item "None" of "Optional[_Loader]" has no attribute "exec_module"
caffe2/python/__init__.py:35: error: Module has no attribute "cuda"
caffe2/python/__init__.py:37: error: Module has no attribute "cuda"
caffe2/python/__init__.py:49: error: Module has no attribute "add_dll_directory"
torch/random.py:4: error: Cannot find implementation or library stub for module named 'torch._C'
torch/_classes.py:2: error: Cannot find implementation or library stub for module named 'torch._C'
torch/onnx/__init__.py:1: error: Cannot find implementation or library stub for module named 'torch._C'
torch/hub.py:21: error: Skipping analyzing 'tqdm.auto': found module but no type hints or library stubs
torch/hub.py:24: error: Skipping analyzing 'tqdm': found module but no type hints or library stubs
torch/hub.py:27: error: Name 'tqdm' already defined (possibly by an import)
torch/_tensor_str.py:164: error: Not all arguments converted during string formatting
torch/_ops.py:1: error: Cannot find implementation or library stub for module named 'torch._C'
torch/_linalg_utils.py:26: error: Name 'Optional' is not defined
torch/_linalg_utils.py:26: note: Did you forget to import it from "typing"? (Suggestion: "from typing import Optional")
torch/_linalg_utils.py:26: error: Name 'Tensor' is not defined
torch/_linalg_utils.py:63: error: Name 'Tensor' is not defined
torch/_linalg_utils.py:63: error: Name 'Optional' is not defined
torch/_linalg_utils.py:63: note: Did you forget to import it from "typing"? (Suggestion: "from typing import Optional")
torch/_linalg_utils.py:70: error: Name 'Optional' is not defined
torch/_linalg_utils.py:70: note: Did you forget to import it from "typing"? (Suggestion: "from typing import Optional")
torch/_linalg_utils.py:70: error: Name 'Tensor' is not defined
torch/_linalg_utils.py:88: error: Name 'Tensor' is not defined
torch/_linalg_utils.py:88: error: Name 'Optional' is not defined
torch/_linalg_utils.py:88: note: Did you forget to import it from "typing"? (Suggestion: "from typing import Optional")
torch/_linalg_utils.py:88: error: Name 'Tuple' is not defined
torch/_linalg_utils.py:88: note: Did you forget to import it from "typing"? (Suggestion: "from typing import Tuple")
torch/_jit_internal.py:17: error: Need type annotation for 'boolean_dispatched'
torch/_jit_internal.py:474: error: Need type annotation for '_overloaded_fns' (hint: "_overloaded_fns: Dict[<type>, <type>] = ...")
torch/_jit_internal.py:512: error: Need type annotation for '_overloaded_methods' (hint: "_overloaded_methods: Dict[<type>, <type>] = ...")
torch/_jit_internal.py:648: error: Incompatible types in assignment (expression has type "FinalCls", variable has type "_SpecialForm")
torch/sparse/__init__.py:11: error: Name 'Tensor' is not defined
torch/sparse/__init__.py:71: error: Name 'Tensor' is not defined
torch/sparse/__init__.py:71: error: Name 'Optional' is not defined
torch/sparse/__init__.py:71: note: Did you forget to import it from "typing"? (Suggestion: "from typing import Optional")
torch/sparse/__init__.py:71: error: Name 'Tuple' is not defined
torch/sparse/__init__.py:71: note: Did you forget to import it from "typing"? (Suggestion: "from typing import Tuple")
torch/nn/init.py:109: error: Name 'Tensor' is not defined
torch/nn/init.py:126: error: Name 'Tensor' is not defined
torch/nn/init.py:142: error: Name 'Tensor' is not defined
torch/nn/init.py:165: error: Name 'Tensor' is not defined
torch/nn/init.py:180: error: Name 'Tensor' is not defined
torch/nn/init.py:194: error: Name 'Tensor' is not defined
torch/nn/init.py:287: error: Name 'Tensor' is not defined
torch/nn/init.py:315: error: Name 'Tensor' is not defined
torch/multiprocessing/reductions.py:8: error: No library stub file for standard library module 'multiprocessing.util'
torch/multiprocessing/reductions.py:9: error: No library stub file for standard library module 'multiprocessing.reduction'
torch/multiprocessing/reductions.py:17: error: No library stub file for standard library module 'multiprocessing.resource_sharer'
torch/jit/_builtins.py:72: error: Module has no attribute "_no_grad_embedding_renorm_"
torch/jit/_builtins.py:80: error: Module has no attribute "stft"
torch/jit/_builtins.py:81: error: Module has no attribute "cdist"
torch/jit/_builtins.py:82: error: Module has no attribute "norm"
torch/jit/_builtins.py:83: error: Module has no attribute "nuclear_norm"
torch/jit/_builtins.py:84: error: Module has no attribute "frobenius_norm"
torch/backends/cudnn/__init__.py:8: error: Cannot find implementation or library stub for module named 'torch._C'
torch/backends/cudnn/__init__.py:86: error: Need type annotation for '_handles' (hint: "_handles: Dict[<type>, <type>] = ...")
torch/autograd/profiler.py:13: error: Name 'ContextDecorator' already defined (possibly by an import)
torch/autograd/function.py:2: error: Cannot find implementation or library stub for module named 'torch._C'
torch/autograd/function.py:2: note: See https://mypy.readthedocs.io/en/latest/running_mypy.html#missing-imports
torch/autograd/function.py:109: error: Unsupported dynamic base class "with_metaclass"
torch/serialization.py:609: error: "Callable[[Any], Any]" has no attribute "cache"
torch/_lowrank.py:11: error: Name 'Tensor' is not defined
torch/_lowrank.py:13: error: Name 'Optional' is not defined
torch/_lowrank.py:13: note: Did you forget to import it from "typing"? (Suggestion: "from typing import Optional")
torch/_lowrank.py:14: error: Name 'Optional' is not defined
torch/_lowrank.py:14: note: Did you forget to import it from "typing"? (Suggestion: "from typing import Optional")
torch/_lowrank.py:14: error: Name 'Tensor' is not defined
torch/_lowrank.py:82: error: Name 'Tensor' is not defined
torch/_lowrank.py:82: error: Name 'Optional' is not defined
torch/_lowrank.py:82: note: Did you forget to import it from "typing"? (Suggestion: "from typing import Optional")
torch/_lowrank.py:82: error: Name 'Tuple' is not defined
torch/_lowrank.py:82: note: Did you forget to import it from "typing"? (Suggestion: "from typing import Tuple")
torch/_lowrank.py:130: error: Name 'Tensor' is not defined
torch/_lowrank.py:130: error: Name 'Optional' is not defined
torch/_lowrank.py:130: note: Did you forget to import it from "typing"? (Suggestion: "from typing import Optional")
torch/_lowrank.py:130: error: Name 'Tuple' is not defined
torch/_lowrank.py:130: note: Did you forget to import it from "typing"? (Suggestion: "from typing import Tuple")
torch/_lowrank.py:167: error: Name 'Tensor' is not defined
torch/_lowrank.py:167: error: Name 'Optional' is not defined
torch/_lowrank.py:167: note: Did you forget to import it from "typing"? (Suggestion: "from typing import Optional")
torch/_lowrank.py:167: error: Name 'Tuple' is not defined
torch/_lowrank.py:167: note: Did you forget to import it from "typing"? (Suggestion: "from typing import Tuple")
torch/quantization/observer.py:45: error: Variable "torch.quantization.observer.ABC" is not valid as a type
torch/quantization/observer.py:45: note: See https://mypy.readthedocs.io/en/latest/common_issues.html#variables-vs-type-aliases
torch/quantization/observer.py:45: error: Invalid base class "ABC"
torch/quantization/observer.py:127: error: Name 'Tensor' is not defined
torch/quantization/observer.py:127: error: Name 'Tuple' is not defined
torch/quantization/observer.py:127: note: Did you forget to import it from "typing"? (Suggestion: "from typing import Tuple")
torch/quantization/observer.py:172: error: Module has no attribute "per_tensor_symmetric"
torch/quantization/observer.py:172: error: Module has no attribute "per_channel_symmetric"
torch/quantization/observer.py:192: error: Name 'Tensor' is not defined
torch/quantization/observer.py:192: error: Name 'Tuple' is not defined
torch/quantization/observer.py:192: note: Did you forget to import it from "typing"? (Suggestion: "from typing import Tuple")
torch/quantization/observer.py:233: error: Module has no attribute "per_tensor_symmetric"
torch/quantization/observer.py:233: error: Module has no attribute "per_channel_symmetric"
torch/quantization/observer.py:534: error: Name 'Tensor' is not defined
torch/quantization/observer.py:885: error: Name 'Tensor' is not defined
torch/quantization/observer.py:885: error: Name 'Tuple' is not defined
torch/quantization/observer.py:885: note: Did you forget to import it from "typing"? (Suggestion: "from typing import Tuple")
torch/quantization/observer.py:894: error: Cannot determine type of 'max_val'
torch/quantization/observer.py:894: error: Cannot determine type of 'min_val'
torch/quantization/observer.py:899: error: Cannot determine type of 'min_val'
torch/quantization/observer.py:902: error: Name 'Tensor' is not defined
torch/quantization/observer.py:925: error: Name 'Tensor' is not defined
torch/quantization/observer.py:928: error: Cannot determine type of 'min_val'
torch/quantization/observer.py:929: error: Cannot determine type of 'max_val'
torch/quantization/observer.py:946: error: Argument "min" to "histc" has incompatible type "Tuple[Tensor, Tensor]"; expected "Union[int, float, bool]"
torch/quantization/observer.py:946: error: Argument "max" to "histc" has incompatible type "Tuple[Tensor, Tensor]"; expected "Union[int, float, bool]"
torch/quantization/observer.py:1056: error: Module has no attribute "per_tensor_symmetric"
torch/quantization/observer.py:1058: error: Module has no attribute "per_channel_symmetric"
torch/nn/quantized/functional.py:76: error: Name 'Tensor' is not defined
torch/nn/quantized/functional.py:76: error: Name 'BroadcastingList2' is not defined
torch/nn/quantized/functional.py:259: error: Name 'Tensor' is not defined
torch/nn/quantized/functional.py:259: error: Name 'Optional' is not defined
torch/nn/quantized/functional.py:259: note: Did you forget to import it from "typing"? (Suggestion: "from typing import Optional")
torch/nn/quantized/functional.py:289: error: Module has no attribute "ops"
torch/nn/quantized/functional.py:290: error: Module has no attribute "ops"
torch/nn/quantized/functional.py:308: error: Name 'Tensor' is not defined
torch/nn/quantized/functional.py:326: error: Name 'Tensor' is not defined
torch/nn/quantized/functional.py:356: error: Name 'Tensor' is not defined
torch/nn/quantized/functional.py:371: error: Name 'Tensor' is not defined
torch/nn/quantized/functional.py:400: error: Name 'Tensor' is not defined
torch/nn/quantized/functional.py:400: error: Name 'Optional' is not defined
torch/nn/quantized/functional.py:400: note: Did you forget to import it from "typing"? (Suggestion: "from typing import Optional")
torch/nn/quantized/functional.py:430: error: Name 'Tensor' is not defined
torch/nn/quantized/functional.py:448: error: Name 'Tensor' is not defined
torch/nn/quantized/modules/linear.py:26: error: Module has no attribute "ops"
torch/nn/quantized/modules/linear.py:28: error: Module has no attribute "ops"
torch/nn/quantized/modules/functional_modules.py:40: error: Name 'Tensor' is not defined
torch/nn/quantized/modules/functional_modules.py:47: error: Name 'Tensor' is not defined
torch/nn/quantized/modules/functional_modules.py:54: error: Name 'Tensor' is not defined
torch/nn/quantized/modules/functional_modules.py:61: error: Name 'Tensor' is not defined
torch/nn/quantized/modules/functional_modules.py:68: error: Name 'List' is not defined
torch/nn/quantized/modules/functional_modules.py:68: note: Did you forget to import it from "typing"? (Suggestion: "from typing import List")
torch/nn/quantized/modules/functional_modules.py:68: error: Name 'Tensor' is not defined
torch/nn/quantized/modules/functional_modules.py:75: error: Name 'Tensor' is not defined
torch/nn/quantized/modules/functional_modules.py:140: error: Name 'Tensor' is not defined
torch/nn/quantized/modules/functional_modules.py:146: error: Name 'Tensor' is not defined
torch/nn/quantized/modules/functional_modules.py:151: error: Name 'Tensor' is not defined
torch/nn/quantized/modules/functional_modules.py:157: error: Name 'Tensor' is not defined
torch/nn/quantized/modules/functional_modules.py:162: error: Name 'List' is not defined
torch/nn/quantized/modules/functional_modules.py:162: note: Did you forget to import it from "typing"? (Suggestion: "from typing import List")
torch/nn/quantized/modules/functional_modules.py:162: error: Name 'Tensor' is not defined
torch/nn/quantized/modules/functional_modules.py:168: error: Name 'Tensor' is not defined
torch/multiprocessing/spawn.py:9: error: Module 'torch.multiprocessing' has no attribute '_prctl_pr_set_pdeathsig'
torch/multiprocessing/__init__.py:28: error: Module has no attribute "__all__"
torch/jit/frontend.py:9: error: Cannot find implementation or library stub for module named 'torch._C._jit_tree_views'
torch/jit/annotations.py:6: error: Module 'torch._jit_internal' has no attribute 'BroadcastingList2'; maybe "BroadcastingList1" or "BroadcastingListCls"?
torch/jit/annotations.py:6: error: Module 'torch._jit_internal' has no attribute 'BroadcastingList3'; maybe "BroadcastingList1" or "BroadcastingListCls"?
torch/jit/annotations.py:9: error: Cannot find implementation or library stub for module named 'torch._C'
torch/distributions/distribution.py:16: error: Need type annotation for 'arg_constraints' (hint: "arg_constraints: Dict[<type>, <type>] = ...")
torch/distributions/distribution.py:74: error: Name 'arg_constraints' already defined on line 16
torch/distributions/distribution.py:84: error: Name 'support' already defined on line 15
torch/functional.py:114: error: Name 'Tuple' is not defined
torch/functional.py:114: note: Did you forget to import it from "typing"? (Suggestion: "from typing import Tuple")
torch/functional.py:114: error: Name 'Optional' is not defined
torch/functional.py:114: note: Did you forget to import it from "typing"? (Suggestion: "from typing import Optional")
torch/functional.py:189: error: Incompatible types in assignment (expression has type "None", variable has type "Tensor")
torch/functional.py:200: error: Argument 1 to "_indices_product" has incompatible type "Tuple[int, ...]"; expected "List[int]"
torch/functional.py:204: error: No overload variant of "__setitem__" of "list" matches argument types "Tensor", "int"
torch/functional.py:204: note: Possible overload variants:
torch/functional.py:204: note:     def __setitem__(self, int, int) -> None
torch/functional.py:204: note:     def __setitem__(self, slice, Iterable[int]) -> None
torch/functional.py:204: error: No overload variant of "__getitem__" of "list" matches argument type "Tensor"
torch/functional.py:204: note:     def __getitem__(self, int) -> int
torch/functional.py:204: note:     def __getitem__(self, slice) -> List[int]
torch/functional.py:207: error: "Tensor" has no attribute "copy_"
torch/functional.py:212: error: No overload variant of "__setitem__" of "list" matches argument types "Tensor", "int"
torch/functional.py:212: note: Possible overload variants:
torch/functional.py:212: note:     def __setitem__(self, int, int) -> None
torch/functional.py:212: note:     def __setitem__(self, slice, Iterable[int]) -> None
torch/functional.py:212: error: No overload variant of "__getitem__" of "list" matches argument type "Tensor"
torch/functional.py:212: note:     def __getitem__(self, int) -> int
torch/functional.py:212: note:     def __getitem__(self, slice) -> List[int]
torch/functional.py:215: error: Incompatible types in assignment (expression has type "None", variable has type "Tensor")
torch/functional.py:334: error: Name 'Optional' is not defined
torch/functional.py:334: note: Did you forget to import it from "typing"? (Suggestion: "from typing import Optional")
torch/functional.py:429: error: Argument 2 to "pad" has incompatible type "Tuple[int, int]"; expected "List[int]"
torch/functional.py:431: error: Module has no attribute "stft"
torch/functional.py:766: error: Module has no attribute "cdist"
torch/functional.py:768: error: Module has no attribute "cdist"
torch/functional.py:770: error: Module has no attribute "cdist"
torch/functional.py:775: error: Name 'Optional' is not defined
torch/functional.py:775: note: Did you forget to import it from "typing"? (Suggestion: "from typing import Optional")
torch/functional.py:780: error: Name 'Optional' is not defined
torch/functional.py:780: note: Did you forget to import it from "typing"? (Suggestion: "from typing import Optional")
torch/functional.py:780: error: Name 'number' is not defined
torch/functional.py:780: error: Name 'norm' already defined on line 775
torch/functional.py:785: error: Name 'Optional' is not defined
torch/functional.py:785: note: Did you forget to import it from "typing"? (Suggestion: "from typing import Optional")
torch/functional.py:785: error: Name 'number' is not defined
torch/functional.py:785: error: Name 'norm' already defined on line 775
torch/functional.py:790: error: Name 'Optional' is not defined
torch/functional.py:790: note: Did you forget to import it from "typing"? (Suggestion: "from typing import Optional")
torch/functional.py:790: error: Name 'norm' already defined on line 775
torch/functional.py:795: error: Name 'norm' already defined on line 775
torch/functional.py:960: error: Name 'Any' is not defined
torch/functional.py:960: note: Did you forget to import it from "typing"? (Suggestion: "from typing import Any")
torch/functional.py:960: error: Name 'Tuple' is not defined
torch/functional.py:960: note: Did you forget to import it from "typing"? (Suggestion: "from typing import Tuple")
torch/functional.py:1036: error: Argument 1 to "len" has incompatible type "int"; expected "Sized"
torch/functional.py:1041: error: Name 'Optional' is not defined
torch/functional.py:1041: note: Did you forget to import it from "typing"? (Suggestion: "from typing import Optional")
torch/functional.py:1041: error: Name 'Tuple' is not defined
torch/functional.py:1041: note: Did you forget to import it from "typing"? (Suggestion: "from typing import Tuple")
torch/functional.py:1056: error: Name 'Optional' is not defined
torch/functional.py:1056: note: Did you forget to import it from "typing"? (Suggestion: "from typing import Optional")
torch/functional.py:1056: error: Name 'Tuple' is not defined
torch/functional.py:1056: note: Did you forget to import it from "typing"? (Suggestion: "from typing import Tuple")
torch/distributions/von_mises.py:87: error: Incompatible types in assignment (expression has type "_Real", base class "Distribution" defined the type as "None")
torch/distributions/negative_binomial.py:25: error: Incompatible types in assignment (expression has type "_IntegerGreaterThan", base class "Distribution" defined the type as "None")
torch/distributions/multivariate_normal.py:116: error: Incompatible types in assignment (expression has type "_Real", base class "Distribution" defined the type as "None")
torch/distributions/laplace.py:23: error: Incompatible types in assignment (expression has type "_Real", base class "Distribution" defined the type as "None")
torch/distributions/independent.py:34: error: Need type annotation for 'arg_constraints' (hint: "arg_constraints: Dict[<type>, <type>] = ...")
torch/distributions/cauchy.py:28: error: Incompatible types in assignment (expression has type "_Real", base class "Distribution" defined the type as "None")
torch/distributions/poisson.py:28: error: Incompatible types in assignment (expression has type "_IntegerGreaterThan", base class "Distribution" defined the type as "None")
torch/distributions/one_hot_categorical.py:32: error: Incompatible types in assignment (expression has type "_Simplex", base class "Distribution" defined the type as "None")
torch/distributions/normal.py:27: error: Incompatible types in assignment (expression has type "_Real", base class "Distribution" defined the type as "None")
torch/distributions/lowrank_multivariate_normal.py:79: error: Incompatible types in assignment (expression has type "_Real", base class "Distribution" defined the type as "None")
torch/distributions/gamma.py:30: error: Incompatible types in assignment (expression has type "_GreaterThan", base class "Distribution" defined the type as "None")
torch/distributions/exponential.py:23: error: Incompatible types in assignment (expression has type "_GreaterThan", base class "Distribution" defined the type as "None")
torch/distributions/fishersnedecor.py:25: error: Incompatible types in assignment (expression has type "_GreaterThan", base class "Distribution" defined the type as "None")
torch/distributions/dirichlet.py:44: error: Incompatible types in assignment (expression has type "_Simplex", base class "Distribution" defined the type as "None")
torch/nn/quantized/dynamic/modules/rnn.py:230: error: Incompatible types in assignment (expression has type "int", variable has type "Tensor")
torch/nn/quantized/dynamic/modules/rnn.py:232: error: Incompatible types in assignment (expression has type "int", variable has type "Tensor")
torch/nn/quantized/dynamic/modules/rnn.py:236: error: Incompatible return value type (got "Tuple[Any, Tensor, Any]", expected "Tuple[int, int, int]")
torch/nn/quantized/dynamic/modules/rnn.py:351: error: Incompatible types in assignment (expression has type "Type[LSTM]", base class "RNNBase" defined the type as "Type[RNNBase]")
torch/nn/quantized/dynamic/modules/rnn.py:381: error: Module has no attribute "quantized_lstm"
torch/nn/quantized/dynamic/modules/rnn.py:385: error: Module has no attribute "quantized_lstm"
torch/nn/quantized/dynamic/modules/rnn.py:414: error: Argument 1 to "forward_impl" of "LSTM" has incompatible type "PackedSequence"; expected "Tensor"
torch/nn/quantized/dynamic/modules/rnn.py:416: error: Incompatible types in assignment (expression has type "PackedSequence", variable has type "Tensor")
torch/nn/quantized/dynamic/modules/rnn.py:418: error: Incompatible return value type (got "Tuple[Tensor, Tuple[Tensor, Tensor]]", expected "Tuple[PackedSequence, Tuple[Tensor, Tensor]]")
torch/nn/quantized/dynamic/modules/rnn.py:420: error: Argument 1 of "permute_hidden" is incompatible with supertype "RNNBase"; supertype defines the argument type as "Tensor"
torch/nn/quantized/dynamic/modules/rnn.py:420: error: Return type "Tuple[Tensor, Tensor]" of "permute_hidden" incompatible with return type "Tensor" in supertype "RNNBase"
torch/nn/quantized/dynamic/modules/rnn.py:426: error: Argument 2 of "check_forward_args" is incompatible with supertype "RNNBase"; supertype defines the argument type as "Tensor"
torch/nn/intrinsic/qat/modules/conv_fused.py:232: error: Incompatible types in assignment (expression has type "Type[ConvBnReLU2d]", base class "ConvBn2d" defined the type as "Type[ConvBn2d]")
torch/distributions/beta.py:27: error: Incompatible types in assignment (expression has type "_Interval", base class "Distribution" defined the type as "None")
torch/distributions/geometric.py:31: error: Incompatible types in assignment (expression has type "_IntegerGreaterThan", base class "Distribution" defined the type as "None")
torch/distributions/continuous_bernoulli.py:38: error: Incompatible types in assignment (expression has type "_Interval", base class "Distribution" defined the type as "None")
torch/distributions/bernoulli.py:30: error: Incompatible types in assignment (expression has type "_Boolean", base class "Distribution" defined the type as "None")
torch/quantization/fake_quantize.py:126: error: Module has no attribute "per_tensor_symmetric"
torch/quantization/fake_quantize.py:132: error: Module has no attribute "per_channel_symmetric"
torch/distributions/transformed_distribution.py:41: error: Need type annotation for 'arg_constraints' (hint: "arg_constraints: Dict[<type>, <type>] = ...")
torch/jit/__init__.py:1: error: Cannot find implementation or library stub for module named 'torch._C'
torch/jit/__init__.py:15: error: Module 'torch.utils' has no attribute 'set_module'
torch/jit/__init__.py:70: error: Name 'Attribute' already defined on line 68
torch/jit/__init__.py:213: error: On Python 3 '{}'.format(b'abc') produces "b'abc'"; use !r if this is a desired behavior
torch/jit/__init__.py:215: error: On Python 3 '{}'.format(b'abc') produces "b'abc'"; use !r if this is a desired behavior
torch/jit/__init__.py:1524: error: Unsupported dynamic base class "with_metaclass"
torch/jit/__init__.py:1869: error: Name 'ScriptModule' already defined on line 1524
torch/jit/__init__.py:1998: error: Need type annotation for '_jit_caching_layer'
torch/jit/__init__.py:1999: error: Need type annotation for '_jit_function_overload_caching'
torch/distributions/relaxed_categorical.py:34: error: Incompatible types in assignment (expression has type "_Real", base class "Distribution" defined the type as "None")
torch/distributions/relaxed_categorical.py:108: error: Incompatible types in assignment (expression has type "_Simplex", base class "Distribution" defined the type as "None")
torch/distributions/relaxed_bernoulli.py:31: error: Incompatible types in assignment (expression has type "_Real", base class "Distribution" defined the type as "None")
torch/distributions/relaxed_bernoulli.py:114: error: Incompatible types in assignment (expression has type "_Interval", base class "Distribution" defined the type as "None")
torch/distributions/logistic_normal.py:31: error: Incompatible types in assignment (expression has type "_Simplex", base class "Distribution" defined the type as "None")
torch/distributions/log_normal.py:26: error: Incompatible types in assignment (expression has type "_GreaterThan", base class "Distribution" defined the type as "None")
torch/distributions/half_normal.py:27: error: Incompatible types in assignment (expression has type "_GreaterThan", base class "Distribution" defined the type as "None")
torch/distributions/half_cauchy.py:28: error: Incompatible types in assignment (expression has type "_GreaterThan", base class "Distribution" defined the type as "None")
torch/distributions/gumbel.py:28: error: Incompatible types in assignment (expression has type "_Real", base class "Distribution" defined the type as "None")
torch/nn/quantized/modules/conv.py:18: error: Module 'torch.nn.utils' has no attribute 'fuse_conv_bn_weights'
torch/nn/quantized/modules/conv.py:209: error: Name 'Optional' is not defined
torch/nn/quantized/modules/conv.py:209: note: Did you forget to import it from "typing"? (Suggestion: "from typing import Optional")
torch/nn/quantized/modules/conv.py:214: error: Module has no attribute "ops"
torch/nn/quantized/modules/conv.py:321: error: Name 'Optional' is not defined
torch/nn/quantized/modules/conv.py:321: note: Did you forget to import it from "typing"? (Suggestion: "from typing import Optional")
torch/nn/quantized/modules/conv.py:323: error: Module has no attribute "ops"
torch/nn/quantized/modules/conv.py:447: error: Name 'Optional' is not defined
torch/nn/quantized/modules/conv.py:447: note: Did you forget to import it from "typing"? (Suggestion: "from typing import Optional")
torch/nn/quantized/modules/conv.py:449: error: Module has no attribute "ops"
torch/nn/quantized/modules/conv.py:513: error: Name 'nn.modules.conv._ConvTransposeNd' is not defined
torch/nn/quantized/modules/conv.py:525: error: Name 'List' is not defined
torch/nn/quantized/modules/conv.py:525: note: Did you forget to import it from "typing"? (Suggestion: "from typing import List")
torch/nn/quantized/modules/conv.py:527: error: Name 'List' is not defined
torch/nn/quantized/modules/conv.py:527: note: Did you forget to import it from "typing"? (Suggestion: "from typing import List")
torch/nn/intrinsic/quantized/modules/conv_relu.py:8: error: Module 'torch.nn.utils' has no attribute 'fuse_conv_bn_weights'
torch/nn/intrinsic/quantized/modules/conv_relu.py:21: error: Incompatible types in assignment (expression has type "Type[ConvReLU2d]", base class "Conv2d" defined the type as "Type[Conv2d]")
torch/nn/intrinsic/quantized/modules/conv_relu.py:62: error: Incompatible types in assignment (expression has type "Type[ConvReLU3d]", base class "Conv3d" defined the type as "Type[Conv3d]")
torch/distributions/weibull.py:25: error: Incompatible types in assignment (expression has type "_GreaterThan", base class "Distribution" defined the type as "None")
torch/distributions/kl.py:35: error: Need type annotation for '_KL_MEMOIZE' (hint: "_KL_MEMOIZE: Dict[<type>, <type>] = ...")
torch/distributions/studentT.py:27: error: Incompatible types in assignment (expression has type "_Real", base class "Distribution" defined the type as "None")
torch/distributions/mixture_same_family.py:48: error: Need type annotation for 'arg_constraints' (hint: "arg_constraints: Dict[<type>, <type>] = ...")
torch/distributions/__init__.py:158: error: Name 'transforms' is not defined
torch/onnx/utils.py:21: error: Cannot find implementation or library stub for module named 'torch._C'
torch/distributed/rendezvous.py:4: error: Cannot find implementation or library stub for module named 'urlparse'
torch/distributed/rendezvous.py:4: error: Name 'urlparse' already defined (possibly by an import)
torch/distributed/rendezvous.py:4: error: Name 'urlunparse' already defined (possibly by an import)
torch/distributed/rendezvous.py:9: error: Module 'torch.distributed' has no attribute 'FileStore'
torch/distributed/rendezvous.py:9: error: Module 'torch.distributed' has no attribute 'TCPStore'
torch/distributed/rendezvous.py:65: error: On Python 3 '{}'.format(b'abc') produces "b'abc'"; use !r if this is a desired behavior
torch/distributed/distributed_c10d.py:11: error: Module 'torch.distributed' has no attribute 'AllreduceOptions'; maybe "ReduceOptions" or "AllreduceCoalescedOptions"?
torch/distributed/distributed_c10d.py:11: error: Module 'torch.distributed' has no attribute 'AllreduceCoalescedOptions'; maybe "AllreduceOptions"?
torch/distributed/distributed_c10d.py:11: error: Module 'torch.distributed' has no attribute 'AllToAllOptions'
torch/distributed/distributed_c10d.py:11: error: Module 'torch.distributed' has no attribute 'BroadcastOptions'
torch/distributed/distributed_c10d.py:11: error: Module 'torch.distributed' has no attribute 'GatherOptions'; maybe "ScatterOptions"?
torch/distributed/distributed_c10d.py:11: error: Module 'torch.distributed' has no attribute 'ReduceOptions'; maybe "AllreduceOptions", "ReduceScatterOptions", or "ReduceOp"?
torch/distributed/distributed_c10d.py:11: error: Module 'torch.distributed' has no attribute 'ReduceScatterOptions'; maybe "ScatterOptions" or "ReduceOptions"?
torch/distributed/distributed_c10d.py:11: error: Module 'torch.distributed' has no attribute 'ScatterOptions'; maybe "ReduceScatterOptions" or
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36584

Reviewed By: seemethere, ailzhang

Differential Revision: D21155985

Pulled By: ezyang

fbshipit-source-id: f628d4293992576207167e7c417998fad15898d1
2020-04-22 14:17:08 -07:00
e921cd222a Move bulky constants from SobolEngineOpsUtil.h to .cpp file (#37086)
Summary:
Also move statics in global namespace to inlines in `at::native::sobol_utils` namespace
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37086

Test Plan:
CI as well as build with Xcode 11.3
Smoke test perf, compiled using: ` cmake ../pytorch -DPYTHON_EXECUTABLE=/usr/bin/python3.7 -DUSE_CUDA=NO -DBUILD_TEST=YES-DCMAKE_CXX_COMPILER=/usr/bin/cuda-g++ -DCMAKE_C_COMPILER=/usr/bin/cuda-gcc -DUSE_MKLDNN=ON -G Ninja` and run on Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GH running FC-30:

Before:
```
$ python3.7 -m timeit -s 'from torch.quasirandom import SobolEngine; sobol = SobolEngine(65, True, 18)' 'sobol.draw(11)'
50000 loops, best of 5: 7.99 usec per loop
```
After:
```
$ python3.7 -m timeit -s 'from torch.quasirandom import SobolEngine; sobol = SobolEngine(65, True, 18)' 'sobol.draw(11)'
50000 loops, best of 5: 7.72 usec per loop
````

Differential Revision: D21182866

Pulled By: malfet

fbshipit-source-id: d3e501ccb9ffbe6395c1598a6f79f2f2f1f37ee0
2020-04-22 14:10:21 -07:00
5fc391a646 Enforce type promotion in torch.cat (#35030)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/35014

CUDA `cat` implementation doesn't use `TensorIterator` so there is the need of manually doing some checks in the code.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35030

Differential Revision: D21155853

Pulled By: nairbv

fbshipit-source-id: 9e78bb7591f806734e12555831157061c925ff40
2020-04-22 13:35:07 -07:00
73bffeff62 scripts: Distinguish between platforms in conda promote (#37089)
Summary:
Files that were named the same within the anaconda repository, i.e.
pytorch_1.5.0-cpu.bz2, were found to be clobbering each other,
especially amongst different platforms.

This lead to similarly named packages for different platforms to not get
promoted.

This also adds "--skip" to our anaconda upload so that we don't end up
overwriting our releases just in case this script gets run twice.

Also, conda search ends up erroring out if it doesn't find anything for
the current platform being searched for so we should just continue
forward if we don't find anything since we want to be able to use this
script for all of the packages we support which also do not release
packages for all of the same platforms. (torchtext for example only has
"noarch")

This should also probably be back-ported to the `release/1.5` branch since this changeset was used to release `v1.5.0`

Signed-off-by: Eli Uriegas <eliuriegas@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37089

Differential Revision: D21184768

Pulled By: seemethere

fbshipit-source-id: dbe12d74df593b57405b178ddb2375691e128a49
2020-04-22 13:29:56 -07:00
76cb7f2043 Use filelist from build_variables.bzl to fetch distributed file list (#37090)
Summary:
Rename `get_filelist` to `append_filelist`
Repalce hadcoded filelist under `USE_DISTRIBUTED` with `append_filelist("libtorch_distributed_sources" TORCH_SRCS)` call
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37090

Test Plan: CI

Differential Revision: D21184002

Pulled By: malfet

fbshipit-source-id: 25bb7f97fcb2bf5bec8bdb3aa059ae13e7610007
2020-04-22 13:13:25 -07:00
7bd2014eec [resubmit][rpc] per-RPC timeouts for rpc_sync and rpc_async (#34650)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34650

Resubmit of https://github.com/pytorch/pytorch/pull/33840, which was overly eager in the sense that it deleted a lot of code that we didn't want to get rid of yet (default timeout handling).

This PR adds an optional argument into `rpc_sync` and `rpc_async` as well as `RpcAgent::send()` that allows the user to specify a timeout for an RPC to override the default set timeout. If the user does not specify this argument, then the currently set default RPC timeout given in the RPC constructor or by `rpc.set_rpc_timeout()` is used. Otherwise, we use the passed in timeout.

This diff does not address:
1) timeout support when called rpc.rpc_async is called as a JIT operator. For this to work, we would need to change the logic in `register_distributed_ops` to pass in this timeout to `rpcTorchscript`. One more issue is that torchscript doesn't support the timedelta object. This will be done in a follow up PR as it requires a fair amount of changes to the argument parsing logic.
2) Per-RPC timeouts for internal messages or `rpc.remote()`. A follow-up diff will address the latter with the approach of raising the timeout error at the earliest next possible time to the user, such as when the next time the RRef is forked or `to_here` is called

Added unit tests to confirm the current behavior
ghstack-source-id: 102622601

Test Plan: Added unit tests in rpc_test

Differential Revision: D20376953

fbshipit-source-id: 9fb3f147520588308ab50dd33286255658d76d47
2020-04-22 13:00:42 -07:00
b0ee6c70aa Remove register_mobile_ops.cpp (#37035)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/37035

Test Plan: Imported from OSS

Differential Revision: D21168316

Pulled By: iseeyuan

fbshipit-source-id: 11a534c94b2a8f75e0e01d56810a968e62a3b706
2020-04-22 12:54:49 -07:00
8a6ab004f7 Dockerfile: Update miniconda installer download location & remove unnecessary flag (#37082)
Summary:
https://repo.continuum.io/ was permanently moved to https://repo.anaconda.com
No need to specify -O since we have -o
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37082

Differential Revision: D21182390

Pulled By: malfet

fbshipit-source-id: eeec70a883cbfd14105abd1ac6685f66afc02c02
2020-04-22 12:26:33 -07:00
5710f278a1 ci: Change file_diff_from_base to be dynamic (#36260)
Summary:
Changes the file_diff_from_base function to get the base reference
directly from CircleCI's pipeline variables instead of being hardcoded
to master.

cc gchanan
Signed-off-by: Eli Uriegas <eliuriegas@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36260

Differential Revision: D21144940

Pulled By: seemethere

fbshipit-source-id: ec6d1c2adcf703119bdab2a43f26a39a5fbaf71b
2020-04-22 11:51:27 -07:00
cf77e56938 clang-format don't run on master (#37058)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37058

We shouldn't add advisory checks to master, because PRs will get
reverted if they fail. This PR makes the following changes:

1. Factor out the binary fetch logic into `clang_format_utils.py`
2. Copypasta the canonical git integration from llvm and modify it to
use our binary fetcher. No more bikeshedding about how to integrate,
we just use the standard integration.
3. Change the CI job to run on pull-requests only and use
`git-clang-format`.
4. The original `clang_format.py` is now renamed `clang_format_all.py`
to reflect its purpose.
5. The pre-commit hook has been changed to use `git-clang-format`.

For pre-commit hook users: no changes required.
For others: add `tools/git-clang-format` to your PATH and you can do `git clang-format` to format your working tree.

Test Plan: Imported from OSS

Differential Revision: D21180893

Pulled By: suo

fbshipit-source-id: f8358fb7ce26f11585226aaac5ed89d379257bfb
2020-04-22 11:37:22 -07:00
171476e870 CUDA implementation of Sparse Adagrad Fusion for GPUs (#35762)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35762

We implement the following operators for Regular and RowWise SparseAdagrad Fusion with SLS and SLWS gradient
- SparseAdagradFusedWithSparseLengthsSumGradient
- RowWiseSparseAdagradFusedWithSparseLengthsSumGradient
- SparseAdagradFusedWithSparseLengthsWeightedSumGradient
- RowWiseSparseAdagradFusedWithSparseLengthsWeightedSumGradient

Test Plan:
- SparseAdagradFusedWithSparseLengthsSumGradient
- RowWiseSparseAdagradFusedWithSparseLengthsSumGradient

```
buck test mode/dev-nosan //caffe2/caffe2/fb/net_transforms/tests:fuse_sparse_ops_test -- 'test_fuse_sparse_adagrad_with_sparse_lengths_sum_gradient \(caffe2\.caffe2\.fb\.net_transforms\.tests\.fuse_sparse_ops_test\.TestFuseSparseOps\)' --print-passing-details
```

- SparseAdagradFusedWithSparseLengthsWeightedSumGradient
- RowWiseSparseAdagradFusedWithSparseLengthsWeightedSumGradient

```
buck test mode/dev-nosan //caffe2/caffe2/fb/net_transforms/tests:fuse_sparse_ops_test -- 'test_fuse_sparse_adagrad_with_sparse_lengths_weighted_sum_gradient \(caffe2\.caffe2\.fb\.net_transforms\.tests\.fuse_sparse_ops_test\.TestFuseSparseOps\)' --print-passing-details
```

Benchmark code:
```
buck run mode/dev-nosan //caffe2/caffe2/fb/optimizers:adagrad_fused_bench_gpu
```

Reviewed By: jspark1105

Differential Revision: D20453096

fbshipit-source-id: bc209348232e3454af0d1d909bbd8ab7f07f69fd
2020-04-22 11:30:20 -07:00
3580c93716 [autograd] Demote the dist container shard line to VLOG(1) (#36978)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36978

We're seeing quite a bit of these running unittests, might be a
bit verbose at LOG(INFO)
ghstack-source-id: 102557335

Test Plan: regular unittest coverage, this is logging-only

Differential Revision: D21149262

fbshipit-source-id: 4992342883920f58484afd8b1e432c1455035835
2020-04-22 10:48:28 -07:00
9b0e7ebab0 [iOS] 1.5.0 Cocoapods Release (#37039)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37039

### Summary

Cocoapods 1.5.0 release. Binary has been pushed to AWS - https://ossci-ios.s3.amazonaws.com/libtorch_ios_1.5.0.zip

### Test Plan

- ` pod spec lint --verbose --allow-warnings --no-clean --use-libraries --skip-import-validation`
- TestApp

Test Plan: Imported from OSS

Differential Revision: D21169113

Pulled By: xta0

fbshipit-source-id: d015c218ed20b168a1ef21025db8880da4e3b074
2020-04-22 10:36:18 -07:00
a00d6758b8 Migrate cosh and cosh_ from TH to ATen (CUDA) (#36654)
Summary:
Closes https://github.com/pytorch/pytorch/issues/24546

Benchmark with same build settings on same system.
gcc : version 7.5.0 (Ubuntu 7.5.0-3ubuntu1~18.04)
CUDA : 10.1
GPU : 1050ti

```python
import timeit

for n, t in [(10_000, 20000),
             (100_000, 20000)]:
    for dtype in ('torch.half', 'torch.float', 'torch.double'):
        print(f'torch.cosh(a) a.numel() == {n} for {t} times {dtype}')
        print(timeit.timeit(f'torch.cosh(a); torch.cuda.synchronize()',
                              setup=f'import torch; a=torch.arange({n}, dtype={dtype}, device="cuda")',
                              number=t))
```

Before:

```
torch.cosh(a) a.numel() == 10000 for 20000 times torch.half
0.2813017509997735
torch.cosh(a) a.numel() == 10000 for 20000 times torch.float
0.28355878599904827
torch.cosh(a) a.numel() == 10000 for 20000 times torch.double
0.27810572300040803
torch.cosh(a) a.numel() == 100000 for 20000 times torch.half
0.3239932899996347
torch.cosh(a) a.numel() == 100000 for 20000 times torch.float
0.321233343998756
torch.cosh(a) a.numel() == 100000 for 20000 times torch.double
0.5546665399997437
```

After:

```
torch.cosh(a) a.numel() == 10000 for 20000 times torch.half
0.2905335750001541
torch.cosh(a) a.numel() == 10000 for 20000 times torch.float
0.27596429500044906
torch.cosh(a) a.numel() == 10000 for 20000 times torch.double
0.30358699899989006
torch.cosh(a) a.numel() == 100000 for 20000 times torch.half
0.30139567500009434
torch.cosh(a) a.numel() == 100000 for 20000 times torch.float
0.30246640400036995
torch.cosh(a) a.numel() == 100000 for 20000 times torch.double
0.5403946970000106

```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36654

Differential Revision: D21164606

Pulled By: VitalyFedyunin

fbshipit-source-id: 55e88f94044957f81599ae3c12cda38a3e2c985c
2020-04-22 10:16:24 -07:00
e7a72bb0c6 Add nomnigraph include folder to Caffe2_GPU_INCLUDE (#37056)
Summary:
Because `caffe2/contrib/tensort` includes nomnigraph headers
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37056

Test Plan: `cmake ../pytorch -DPYTHON_EXECUTABLE=/usr/bin/python3.7 -DCMAKE_BUILD_TYPE=RELWITHDEBINFO -DUSE_CUDA=YES -DBUILD_TEST=YES -DUSE_TENSORRT=YES -DTENSORRT_ROOT=$HOME/Downloads/TensorRT-7.0.0.11 -DCMAKE_CXX_COMPILER=/usr/bin/cuda-g++ -DCMAKE_C_COMPILER=/usr/bin/cuda-gcc -DUSE_MKLDNN=ON -G Ninja; ninja torch_cuda`

Differential Revision: D21178927

Pulled By: malfet

fbshipit-source-id: e1bed94fdb395ebfd6eb5d950ca378da77592531
2020-04-22 09:44:13 -07:00
7c9e7ef128 Revert D21171747: [pytorch][PR] Rebase xla job on top master before running CI build.
Test Plan: revert-hammer

Differential Revision:
D21171747

Original commit changeset: 433ea0e14d03

fbshipit-source-id: 6d5538a3533356997077bb1b8cd46aa6ec4332f8
2020-04-22 09:27:25 -07:00
e75fb4356b Remove (most) Python 2 support from Python code (#35615)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35615

Python 2 has reached end-of-life and is no longer supported by PyTorch.
Now we can clean up a lot of cruft that we put in place to support it.
These changes were all done manually, and I skipped anything that seemed
like it would take more than a few seconds, so I think it makes sense to
review it manually as well (though using side-by-side view and ignoring
whitespace change might be helpful).

Test Plan: CI

Differential Revision: D20842886

Pulled By: dreiss

fbshipit-source-id: 8cad4e87c45895e7ce3938a88e61157a79504aed
2020-04-22 09:23:14 -07:00
a894fff265 Back out "Revert D21089648: Put TORCH_LIBRARY in torch/library.h; add custom class API"
Summary: Original commit changeset: 636e8a11afc6

Test Plan: export to OSS

Reviewed By: malfet

Differential Revision: D21170502

fbshipit-source-id: e8f35f103c4924aedbcaaf868475008d24bdeeab
2020-04-22 09:18:23 -07:00
3b832ee2bf Use Python3 super() throughout torch.testing. (#37024)
Summary:
Hattip to ezyang
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37024

Differential Revision: D21173244

Pulled By: malfet

fbshipit-source-id: 7079703e28777d873f69bf9fd4dcbad8d53a2682
2020-04-22 09:00:28 -07:00
28fadfc4eb Reduce overheads on several CPU kernels by avoiding restrides. (#36875)
Summary:
Calling `t.as_strided(..., ...)` must make a new `TensorImpl` to back the new tensor, which takes 300-400 ns. Reduction, scatter/gather, and comparison kernels currently restride inputs and outputs in order to handle `dim` inside the function passed to TensorIterator. Because these Tensors are created solely for consumption by the iterator a full restride and metadata copy is surplus to requirements. Moreover, shapes are already checked by these kernels prior to calling `add_input` and `add_output`, so shape inference and broadcasting are also unnecessary.

This PR adds a `TensorIterator::declare_static_shape(...)` method, which allows certain kernels to use a much more constrained and efficient shape path. This results in a 900-1200 ns speedup for `gather / scatter / scatter_add / cumsum / cumprod` and a 250-500 ns speedup for elementwise `min` and `max`.

Measurements were taken with [this python script](https://gist.github.com/robieta/51ac5db2f9c7e812d5ff264403ce4f92), which is driven by [this bash script](https://gist.github.com/robieta/1420e917cf38885de3093f8c3a7bd437). The general procedure for mitigating environmental skew is to repeatedly switch between an environment which is built with master and one which is built with this branch while running the python script. Within the python measurement script the following was used to reduce variation:
* Set number of threads to 1
* Aggressively and randomly interleave task measurements to limit correlation between tasks and system state based on when they were run or what task preceded the current one.
* Warmup period, dropping the first three passes through all of the tasks.
Two independent end-to-end runs are included since there is some variation even with the above measures. Overall measurement error seems to be about +/- 100 ns.

The benchmark also includes several tasks which are not affected by this PR, both to check for a degradation in TensorIterator performance when static shapes are not set (which did happen for an earlier iteration of this optimization) and to estimate measurement variability and validate that measured improvements are significant.

**First run**:
```
                          Delta (median)     Master     (25%,  75%)          Branch     (25%,  75%)
---------------------------------------------------------------------------------------------------------
gather_1D                |     920      |    4,000     (-170, +230)       |  3,100     (-110, +140)
gather_dim0              |     910      |    4,100     (-170, +230)       |  3,200     (-110, +150)
gather_dim1              |   1,200      |    4,400     (-190, +240)       |  3,200     (-120, +150)
scatter_1D               |   1,100      |    2,800     (-120, +160)       |  1,700     (-64 , +81)
scatter_dim0             |   1,000      |    2,900     (-130, +160)       |  1,900     (-72 , +95)
scatter_dim1             |   1,200      |    3,200     (-130, +170)       |  1,900     (-67 , +87)
scatter_add_1D           |   1,100      |    2,800     (-120, +150)       |  1,700     (-68 , +89)
scatter_add_dim0         |   1,000      |    2,900     (-120, +150)       |  1,900     (-77 , +93)
scatter_add_dim1         |   1,300      |    3,100     (-140, +180)       |  1,900     (-76 , +92)
cumsum_1D                |   1,000      |    4,600     (-200, +260)       |  3,600     (-120, +170)
cumsum_dim0              |     860      |    4,500     (-190, +240)       |  3,700     (-140, +180)
cumsum_dim1              |   1,200      |    4,800     (-210, +260)       |  3,700     (-130, +180)
cumprod_1D               |   1,000      |    4,600     (-200, +270)       |  3,600     (-130, +170)
cumprod_dim0             |     910      |    4,600     (-210, +270)       |  3,700     (-130, +170)
cumprod_dim1             |   1,200      |    4,900     (-220, +290)       |  3,700     (-130, +170)
min_dim0                 |     280      |    5,900     (-220, +270)       |  5,600     (-220, +260)
min_dim1                 |     560      |    6,200     (-230, +310)       |  5,600     (-230, +270)
max_dim0                 |     320      |    5,900     (-220, +280)       |  5,600     (-200, +250)
max_dim1                 |     540      |    6,100     (-250, +310)       |  5,600     (-200, +250)
std       (reference)    |      58      |    4,300     (-180, +280)       |  4,200     (-160, +200)
clamp     (reference)    |      87      |    3,400     (-160, +220)       |  3,400     (-140, +170)
argmin    (reference)    |     -85      |    3,900     (-170, +250)       |  4,000     (-170, +200)
sum       (reference)    |     -11      |    4,200     (-180, +240)       |  4,200     (-160, +190)
x < y     (reference)    |     110      |    3,700     (-170, +290)       |  3,500     (-140, +150)
max(x, y) (reference)    |     170      |    3,600     (-170, +200)       |  3,400     (-140, +180)

* Times in nanoseconds
**Deltas: positive is improvement, negative is regression.
```

**Second run:**
```
                          Delta (median)     Master     (25%,  75%)          Branch     (25%,  75%)
---------------------------------------------------------------------------------------------------------
gather_1D                |     850      |    3,900     (-130, +150)       |  3,000     (-110, +130)
gather_dim0              |     860      |    4,000     (-140, +150)       |  3,200     (-110, +150)
gather_dim1              |   1,200      |    4,300     (-160, +160)       |  3,200     (-110, +150)
scatter_1D               |   1,100      |    2,700     (-98 , +110)       |  1,700     (-64 , +83)
scatter_dim0             |     950      |    2,800     (-100, +110)       |  1,900     (-67 , +88)
scatter_dim1             |   1,200      |    3,100     (-120, +140)       |  1,900     (-69 , +88)
scatter_add_1D           |   1,100      |    2,700     (-92 , +110)       |  1,700     (-65 , +95)
scatter_add_dim0         |     960      |    2,800     (-100, +100)       |  1,900     (-74 , +100)
scatter_add_dim1         |   1,200      |    3,100     (-100, +130)       |  1,900     (-72 , +100)
cumsum_1D                |     960      |    4,500     (-140, +190)       |  3,600     (-130, +170)
cumsum_dim0              |     820      |    4,500     (-140, +180)       |  3,700     (-130, +170)
cumsum_dim1              |   1,100      |    4,800     (-160, +200)       |  3,600     (-120, +170)
cumprod_1D               |     960      |    4,500     (-130, +190)       |  3,600     (-130, +180)
cumprod_dim0             |     820      |    4,500     (-150, +190)       |  3,700     (-130, +180)
cumprod_dim1             |   1,100      |    4,800     (-150, +220)       |  3,700     (-130, +180)
min_dim0                 |     260      |    5,800     (-210, +250)       |  5,500     (-200, +230)
min_dim1                 |     580      |    6,100     (-230, +270)       |  5,500     (-200, +220)
max_dim0                 |     250      |    5,800     (-210, +230)       |  5,600     (-170, +210)
max_dim1                 |     520      |    6,100     (-220, +240)       |  5,600     (-180, +210)
std       (reference)    |     170      |    4,300     (-210, +220)       |  4,100     (-160, +190)
clamp     (reference)    |     140      |    3,400     (-140, +170)       |  3,300     (-120, +170)
argmin    (reference)    |     -51      |    3,800     (-170, +190)       |  3,900     (-140, +160)
sum       (reference)    |     -58      |    4,100     (-160, +170)       |  4,200     (-170, +190)
x < y     (reference)    |      64      |    3,600     (-150, +210)       |  3,500     (-140, +180)
max(x, y) (reference)    |     120      |    3,500     (-130, +150)       |  3,400     (-130, +150)

* Times in nanoseconds
**Deltas: positive is improvement, negative is regression.
```

CC ilia-cher VitalyFedyunin glaringlee gdankel
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36875

Differential Revision: D21173011

Pulled By: robieta

fbshipit-source-id: 2067ab62f8f8d7b50e20a486a262864480699bbe
2020-04-22 08:58:53 -07:00
25eb250d77 Added complex dtypes to get_all_math_dtypes, complex acc type for cpu, fixed rdiv and pow for complex (#36747)
Summary:
Resolves https://github.com/pytorch/pytorch/issues/36730 https://github.com/pytorch/pytorch/issues/36057
Partially resolves: https://github.com/pytorch/pytorch/issues/36671
```
>>> 2j / torch.tensor([4], dtype = torch.complex64)
tensor([(0.0000+0.5000j)], dtype=torch.complex64)
>>> 1 / torch.tensor(3+4j)
tensor((0.1200-0.1600j), dtype=torch.complex64)
```
rdiv is more generally broken for all dtypes because it doesn't promote the types properly
eg.
```
>>> 1 / torch.tensor(2)
tensor(0)
>>> 2j / torch.tensor(4)
tensor(0)
```
so that issue should be fixed in a separate PR

Adding CPU acc types for complex
Added cumsum, cumprod for complex dtypes

Added complex dtypes to get_all_math_dtypes to expand testing for complex dtypes
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36747

Differential Revision: D21138687

Pulled By: anjali411

fbshipit-source-id: ad3602ccf86c70294a6e71e564cb0d46c393dfab
2020-04-22 08:52:41 -07:00
191fa528f5 Rebase xla job on top master before running CI build. (#36852)
Summary:
This PR tries to rebase on top of origin/master before building xla job.
I also saw a TODO in existing code which does a very similar thing (rebase on master for gcc5 jobs), so I just fixed the TODO by moving the logic into a separate step.
Currently the logic is:
For these gcc5 and xla jobs, we rebase on top of "target" branch before building.
- This only happens on PRs.
- "Target" branch is "origin/master" by default, but if it's trying to merge into a release branch, target branch will be the release branch.
- I made the "target" branch a param mainly it's allow us to rebase on `viable/strict` if we want. But after a second thought, how quickly `viable/strict` moves forward is not controlled only by xla job, and it's hard to predict how long the breakage will last if it's not moving. But we do have control over how long a xla breakage lasts on `origin/master` (which should be short since we monitor it closely). So I currently want to keep `origin/master` and move to `viable/strict` when it's super stable.
- There're jobs like `pytorch_paralleltbb_linux_xenial_py3_6_gcc5_4_build` which would fall into the rebase logic as well, but since those jobs doesn't run on PRs(so the old logic was essentially no-op), I didn't enabled the new logic on those jobs.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36852

Differential Revision: D21171747

Pulled By: ailzhang

fbshipit-source-id: 433ea0e14d030e2e0fa74d2ff4244327e9db7044
2020-04-22 08:46:54 -07:00
3e3498cf03 [quant][graphmode] torch.clamp (#36887)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/36887

Test Plan: Imported from OSS

Differential Revision: D21110469

Pulled By: z-a-f

fbshipit-source-id: 6829f5c315b8950a89364132bca953d78cd4ff3d
2020-04-22 04:20:41 -07:00
799793f279 [TensorExpr] Cleanup IRPrinter implementation for statements. (#37050)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37050

With this change curly braces are printed as a part of Block rather than
a part of the enclosing statement. It allows us, for instance, to more
easily see nested blocks: now they will be printed each in its own
curly-braced scope.

As a side effect, I had to change how we print loop options. Previously
we did it like this:
```
for (...) { // <loop options>
  <loop body (Block)>
}
```

Now, since everything in between { and } is a part of the block, we have
to do it the following way:
```
for (...) /* <loop options> */ {
  <loop body (Block)>
}
```
Note the change from '//' to '/* .. */' for the loop option comments.

Test Plan: Imported from OSS

Differential Revision: D21171851

Pulled By: ZolotukhinM

fbshipit-source-id: 39f51a9e15aec03b6527b0634fd4b9e01a912cda
2020-04-21 23:20:18 -07:00
b8e2d797c0 [TensorExpr] Insert allocations for temporary buffer at the innermost valid scope. (#36836)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/36836

Test Plan: Imported from OSS

Differential Revision: D21099913

Pulled By: ZolotukhinM

fbshipit-source-id: 8faf5f1d55b60bdd4f4b2b909977aeb7abaa95b4
2020-04-21 22:51:46 -07:00
6df90bcecc setup.py: Remove conflicting double documentation of USE_FBGEMM (#36993)
Summary:
Line 33+ contains instructions on how to disable use, 108+ on how to enable it.
The default in CMakeLists.txt is enabled, so drop the latter.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36993

Differential Revision: D21161793

Pulled By: ngimel

fbshipit-source-id: 08c5eecaf8768491f90d4a52c338ecea32a0c35e
2020-04-21 22:33:49 -07:00
4593d87b84 Do not link torch_python with nccl (#37040)
Summary:
If NCCL is used, just allow `torch_python` to access enums defined in header file
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37040

Test Plan: `cmake ../pytorch -DPYTHON_EXECUTABLE=/usr/bin/python3.7 -DCMAKE_BUILD_TYPE=RELWITHDEBINFO -DUSE_CUDA=YES -DBUILD_TEST=YES -DUSE_NCCL=YES -DUSE_DISTRIBUTED=NO -DCMAKE_CXX_COMPILER=/usr/bin/cuda-g++ -DCMAKE_C_COMPILER=/usr/bin/cuda-gcc -DUSE_MKLDNN=ON -G Ninja` + `ninja torch_python`

Differential Revision: D21171573

Pulled By: malfet

fbshipit-source-id: e5eba0f610da3b0fcd17342ad46458dc7b0d251b
2020-04-21 21:00:49 -07:00
4bbc49f53a Revert D21143025: [reland][quant] QuantizedCUDA implementation
Test Plan: revert-hammer

Differential Revision:
D21143025

Original commit changeset: 11405e2e8f87

fbshipit-source-id: ce471ec95c1fc6abff6d1bbdba11bef02f3a0d62
2020-04-21 20:36:12 -07:00
7b03ce7bb3 make sure logs work inside aten/c10 namespaces as well (#37018)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/37018

Differential Revision: D21167933

Pulled By: Krovatkin

fbshipit-source-id: b63f890c19b9887d5709b308ef691f5061cd27b8
2020-04-21 20:01:13 -07:00
4a2372bc90 Implements torch.isclose for complex tensors (#36456)
Summary:
Previously torch.isclose would RuntimeError when called on complex tensors. This update updates torch.isclose to run on complex tensors and be consistent with [NumPy](https://numpy.org/doc/1.18/reference/generated/numpy.isclose.html). However, NumPy's handling of NaN, -inf, and inf values is odd, so I adopted  Python's [cmath.isclose](https://docs.python.org/3/library/cmath.html) behavior when dealing with them. See https://github.com/numpy/numpy/issues/15959 for more on NumPy's behavior.

While implementing complex isclose I also simplified the isclose algorithm to:

- A is close to B if A and B are equal, if equal_nan is true then NaN is equal to NaN
- If A and B are finite, then A is close to B if `abs(a - b) <= (atol + abs(rtol * b))`

This PR also documents torch.isclose, since it was undocumented, and adds multiple tests for its behavior to test_torch.py since it had no dedicated tests.

The PR leaves equal_nan=True with complex inputs an error for now, pending the outcome of https://github.com/numpy/numpy/issues/15959.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36456

Differential Revision: D21159853

Pulled By: mruberry

fbshipit-source-id: fb18fa7048e6104cc24f5ce308fdfb0ba5e4bb30
2020-04-21 19:53:55 -07:00
5c2b273089 Add RRef Python Helper to launch function on the referenced object (#36619)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36619

With this PR, applications no longer need to create dedicated helpers
to run functions on the object referenced by an RRef. Instead,
`rref.rpc_sync().some_func()` will use `rpc_sync` to run `some_func`
on the owner of the RRef using the object referenced by the RRef.
Similar helpers for `rref.rpc_async().some_func()` and
`rref.remote().some_func()` are also added.

An alternative design is to expose PyRRef as RRefBase and then
implement everything in a new Python RRef class. However, the RRef
class cannot directly inherit from PyRRef/RRefBase, otherwise we
will need to let pyRemote* C++ functions to load RRef from Python
and return an RRef instance. It is possible to let RRef hold a
instance of PyRRef instead of inherit from it, but this does not
look like a elegant design, as we will have RRef holding PyRRef and
PyRRef holding the C++ RRef. Another alternative is to use dynamic
method loading, by installing member methods to PyRRef instances.
However, this would require different solutions to handle
RRef(data) and rpc.remote(...). Base on the above thinking, we
decided to go with the current implementation for simplicity and we
can also keep all RRef-related APIs in one place.

Test Plan: Imported from OSS

Differential Revision: D21028333

Pulled By: mrshenli

fbshipit-source-id: fe90f56ef7183d18874e357900093755e1601eb4
2020-04-21 19:29:54 -07:00
b982a6a247 Expose torch.distributed.is_available() API (#37021)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/37021

Test Plan: Imported from OSS

Differential Revision: D21164318

Pulled By: mrshenli

fbshipit-source-id: 08a446af342cbe54f3eb4994956ffa7ef4922bcf
2020-04-21 18:38:46 -07:00
25abdcb3d1 [TensorExpr] add Block flattening to IR Simplifier (#37013)
Summary:
Some IR optimizations were leaving superfluous Blocks in the IR, this PR adds simplification and merging of enclosing Block statements to the IR Simplifier, e.g.

```
Block {
   Stmt 1
   Block {
       Stmt 2
   }
   Block {}
}
```

becomes

```
Block {
   Stmt 1
   Stmt 2
}
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37013

Differential Revision: D21166208

Pulled By: nickgg

fbshipit-source-id: 6dcdf863980d94731a8ddf184882c4a5b7259381
2020-04-21 17:58:18 -07:00
a850d8a526 Fixes exponential with lambda=0 (#36837)
Summary:
Resolves https://github.com/pytorch/pytorch/issues/36798.

In the future more thorough testing would be nice.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36837

Differential Revision: D21102342

Pulled By: mruberry

fbshipit-source-id: 4fae45677e54b403296033720dfb13abca47f3a4
2020-04-21 17:34:07 -07:00
dc327d9082 [TensorExpr] Remove obsolete code for handling dynamic shapes from kernel.cpp. (#36686)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/36686

Test Plan: Imported from OSS

Differential Revision: D21053305

Pulled By: ZolotukhinM

fbshipit-source-id: eb6111df8aead1fd881749141b87a1395285eb0e
2020-04-21 17:29:12 -07:00
359e7f4bba Teach IRParser to parse strides along with sizes in a tensor type. (#36951)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/36951

Test Plan: Imported from OSS

Differential Revision: D21139940

Pulled By: ZolotukhinM

fbshipit-source-id: b56a1fddfc9de4684da3ba9a462e344d0985e8b6
2020-04-21 17:27:15 -07:00
8eb22f6ee9 Revert D21161361: [pytorch][PR] Revert "Revert D21089648: Put TORCH_LIBRARY in torch/library.h; add custom class API"
Test Plan: revert-hammer

Differential Revision:
D21161361

Original commit changeset: dca4192d3f7b

fbshipit-source-id: 311b6ab6169feb5e12ae8c959789f8f0acd5205d
2020-04-21 17:13:05 -07:00
443fe7ca0e [rpc] Avoid wireDeserializer overreading buffers by 1 byte (#36976)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36976

The bounds check and the read were swapped in two places - I noticed
ASAN complaining in an unrelated change on an erroneous buffer.

Adding a couple simple test cases.
ghstack-source-id: 102606986

Test Plan: buck test mode/dev caffe2/test/cpp/rpc:

Differential Revision: D21148936

fbshipit-source-id: 7ec5007535f7310437ac1b9a72852a223b9dd29a
2020-04-21 17:01:45 -07:00
b019a8d484 fix spatialbatchnorm on nnpi (#36987)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36987

the discrepancy comes from using eigen's sqrt.
Replaced it with std::sqrt and worked, so using MKLs version
Removed momentum, made epsilon float, enhanced the test with hypothesis

Test Plan: testing the mkl dependencies in prod, if things work, will remove the intrinsics implementation, if no, will use intrinsics

Reviewed By: yinghai

Differential Revision: D21151661

fbshipit-source-id: 56e617b13bc32b0020691f7201d16dee00f651b5
2020-04-21 16:52:13 -07:00
f0a533c5dd Fix flaky test_backward_node_failure_python_udf (#36969)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36969

`test_backward_node_failure_python_udf` was flaky since it used the
RPC framework to indicate rank 0 was done with processing. Since we kill nodes
in this unit test, it is very likely that listenLoop() has exited on some nodes
and hence using an RPC to inform all nodes about rank 0's completion
might not work, since the RPC might not be processed on certain nodes.

To fix this, we use the c10d store instead for this notification.
ghstack-source-id: 102549873

Test Plan: waitforbuildbot

Differential Revision: D21147099

fbshipit-source-id: 745273a6cae0debbae131bb4cc7debe9c201bf98
2020-04-21 16:42:47 -07:00
1592d6842c [resubmit] Move profiler to a dispatch wrapper (#36766)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36766

Original commit changeset: dcb41d243369
ghstack-source-id: 102614215

Test Plan: waitforsadcastle

Differential Revision: D21076029

fbshipit-source-id: c2461c57cfd364bd23ff99bc2cb5572d22e23391
2020-04-21 16:37:11 -07:00
bcdb0727c2 Revert D20907254: Fix long line splitting issue in python_print
Test Plan: revert-hammer

Differential Revision:
D20907254

Original commit changeset: ebfc1a4eefc2

fbshipit-source-id: 76440a8649a17728c50e2f3eeb3744a2245f6daf
2020-04-21 16:24:32 -07:00
a92f1dc85e native_functions.yaml: reset_grad_accumulator (#36431)
Summary:
Add optional 'reset_grad_accumulator' in native_functions.yaml
Use it for all overloads of `set_`

Fixes https://github.com/pytorch/pytorch/issues/33941
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36431

Differential Revision: D21156062

Pulled By: ezyang

fbshipit-source-id: 5a63fd091a618a33cb05fd96bbb1e87162abc9a4
2020-04-21 16:19:15 -07:00
cyy
806f22b167 find backtrace by cmake module (#36017)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/36017

Reviewed By: zou3519, albanD

Differential Revision: D20870233

Pulled By: ezyang

fbshipit-source-id: b11daf22a900e47b5b72272fb2f096d78c075bf8
2020-04-21 16:00:33 -07:00
6ebfff6c4e Add locks to fallback register/deregister. (#36628)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36628

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Differential Revision: D21029702

Pulled By: ezyang

fbshipit-source-id: 2322094338ad896653b2db43ff74a8ab1593b3e1
2020-04-21 15:55:20 -07:00
bf676682e7 Fix long line splitting issue in python_print (#36188)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36188

* Need to remove n^2 behavior for scanning whether to split or not
  otherwise long inline chains will take a long time re-scanning.

Test Plan: Imported from OSS

Differential Revision: D20907254

Pulled By: zdevito

fbshipit-source-id: ebfc1a4eefc26d5806381e7afd75b7a9cd4cde97
2020-04-21 15:46:42 -07:00
e1742e8e4e Revert "Revert D21089648: Put TORCH_LIBRARY in torch/library.h; add custom class API" (#37019)
Summary:
This reverts commit 2ccdc39dce91a1821ede9bfdea26b30d66e1554f.

Original PR: https://github.com/pytorch/pytorch/pull/36742
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37019

Differential Revision: D21161361

Pulled By: ezyang

fbshipit-source-id: dca4192d3f7be25a34bbe3d57ddce3afc1c2558c
2020-04-21 15:39:39 -07:00
6eb109e1ad Enable float only requantization. Part 1. (#35856)
Summary:
This PR is motivated by two issues it tries to address:
1) relax the constraint on requantization scale (<1).
2) Unify requantization methodology across pytorch integration of QNNPACK and FBGEMM.

Here we are trying to address the first part for Conv and Linear.
Existing requantization scheme performs scale multiplication entirely in integer arithmetic by extracting mantissa and exponent part of FP scale and processing them. This including appropriate rounding required. The set of instruction, corresponding to this, are specifically tailored for the condition when scale < 1.

Relaxing this constraint requires us to fix that sequence of instruction. In this PR we take a simpler approach of essentially converting Int32 to FP32, apply scale, convert FP32 to Int32 with appropriate rounding, to-nearest-ties-to-even. This is followed by zero point add and clipping. Since in 32-bit ARM nearest-ties-to-even rounding instruction is not available, the sequence is little different. Sequence for both 32-bit and 64-bit are taken from https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/native/quantized/cpu/qnnpack/src/requantization/fp32-neon.c.

Furthermore relaxing the scale constraint and moving towards FP requantization also helps us move towards unifying requantization producer across QNNPACK and FBGEMM.

Summary of the PR:
- requantization params are modified to lift some computation that would have to be in the kernel otherwise for aarch32 kernels, particularly:
   - Computing vfmin, vfmax, vfmagic and vimagic.
- Fixed q8gemm, q8conv and q8dwconv kernels.
- Fixed the corresponding tests.

What is not done:
- XZP kernels are not changed as part of this PR.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35856

Differential Revision: D20996325

Pulled By: kimishpatel

fbshipit-source-id: 7a7a18b09dd2564768142371db06d98bf7479f49
2020-04-21 15:20:20 -07:00
1f82679311 Revert D21156042: [pytorch][PR] CMake/Ninja: fix dependencies for .cu files
Test Plan: revert-hammer

Differential Revision:
D21156042

Original commit changeset: fda3aaa57207

fbshipit-source-id: 59b208d4dc7ab743876af3ed382477770526aa1a
2020-04-21 14:24:27 -07:00
4e463b6366 add missing ops for portal TTS model (again) (#37007)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37007

D20961463 was reverted due to clang-format. Redo it.

Test Plan: verified TTS model can be loaded without problem

Reviewed By: iseeyuan

Differential Revision: D21157626

fbshipit-source-id: 372bf6196da20b3ebafa283c5c3f7c924a37ed60
2020-04-21 14:12:01 -07:00
b607c83a26 Add support for bool/byte attn_mask tensor in MultiheadAttention/Transformer modules (#33763)
Summary:
Add the support to accept both float, byte, and bool tensors for `attn_mask`. No breakage is expected.

- If a bool tensor is provided, positions with `True` are not allowed to attend while `False` values will be unchanged.
- if a byte tensor is provided, it will be converted to bool tensor. Positions with non-zero are not allowed to attend while zero values will be unchanged.
- If a float tensor is provided, it will be added to the attention weight.

Note: the behavior of the float mask tensor is slightly different from the first two options because it is added to the attention weight, rather than calling `masked_fill_` function. Also, converting a byte tensor to bool tensor within `multi_head_attention_forward` causes extra overhead. Therefore, a bool mask is recommended here.

For `key_padding_mask`:
- if a bool tensor is provided, it will be converted to bool tensor. The positions with the value of `True` will be ignored while the position with the value of `False` will be unchanged.
- If a byte tensor is provided, the positions with the value of non-zero will be ignored while the position with the value of zero will be unchanged.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33763

Differential Revision: D20925358

Pulled By: zhangguanheng66

fbshipit-source-id: de174056be183cdad0f3de8024ee0a3c5eb364c9
2020-04-21 14:06:59 -07:00
9854df673c [TensorExpr] Fix bug in For elimination in the IRSimplifier. (#36965)
Summary:
When doing elimination of For loops which execute once, e.g. `for i = 0; i < 1; ++i { thing; } => thing;` we do var substitution while the temporary simplifier ExprNodes still exist, which could put them in an invalid state and leave unsimplified terms in the expression. The fix is to apply substitution before simplifying the body of the for loop.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36965

Differential Revision: D21145248

Pulled By: nickgg

fbshipit-source-id: d874600c7a098fc05b8ef3109e516e2eaa2c24e0
2020-04-21 13:38:09 -07:00
6d13a334f6 Remove use_c10_dispatcher: unboxed_only (#36838)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36838

All ops now do unboxing after dispatch, i.e. c10 handles unboxing and c10 registers a wrapper for the op to JIT
The last op that manually registered its own wrapper to JIT in register_aten_ops.cpp was migrated.

Since there are no ops using use_c10_dispatcher: unboxed_only anymore, we can delete the feature.

Also:
- Rename some files to more accurately describe what they're doing now:
  - OpsAlreadyMovedToC10.h/cpp -> ATenOpList.h/cpp
  - register_aten_ops.cpp -> generated_unboxing_wrappers.cpp
  - gen_jit_dispatch.py -> gen_unboxing_wrappers.cpp
ghstack-source-id: 102532915

Test Plan: waitforsandcastle

Differential Revision: D21100081

fbshipit-source-id: be824958eef33f6cd42a6a652175bd0b1df4ebf9
2020-04-21 13:32:33 -07:00
ea97fa1f2a [PyTorch][Dist] Trigger pre/post hooks of output function nodes under distributed autograd (#34501)
Summary:
# Goals
Do the following things during a distributed backward pass.
1. Accumulate the gradient of a variable to RPC context once the gradient is ready instead of at the very end of the backward pass.
2. Run post/pre hooks installed in`AccumulateGrad` nodes once the gradient is ready for the variable. Currently, the hooks in `AccumulateGrad` are not executed just because the function `AccumulateGrad` itself is not even evaluated by the local engine.
3. Make it extensible to support post hooks installed by DDP's reducer.

# Introduce GradCapturePreHook

## Why do we need this?

### Root issue:

* dist engine uses the autograd.grad-like API on the vanilla engine and then in the Future callback populates the context with the gradients. This is a bad emulation of the .backward() call on the vanilla engine.

### Practical issue:

* The leaf’s hook are not called (because associated with the AccumulateGrad that is not call in the autograd.grad-like API). Modules like DDP rely on these hooks.
* The Future is marked as completed before the context is actually populated with the grads leading to unexpected behavior on the user side.
* The Future callback is only called at the complete end of the backward and so too late for DDP if they want to overlap compute/transfert.

### Proposed solution:

* Provide hooks in the autograd.grad-like API that will allow the distributed engine to populate the context and call the hooks to better emulate the .backward call.

## Who can install a grad capture pre-hook?

This will be an internal hook at C++ level and it won’t be exposed to PyThon code. Only call-sites directly interacting with the local engine can install such hooks.

## Signature
The returned `grad` will be captured.
```
virtual const torch::Tensor& grad operator()(const torch::Tensor& grads) = 0;
```

## Where are hooks installed?

Grad capture pre-hooks are install in GraphTask::ExecInfo::Capture. ExecInfo is per node. Every backward run will have its own GraphTask instance.

## When/How will hooks be called?

When the local engine captures the grads for a node, all grad capture pre hooks are called one by one in the order they are added. The output grads of the hooks will replace the original grads.
The output of the last hook will be used for grad capturing.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34501

Test Plan:
All existing tests should pass.

```

python setup.py develop

python test/distributed/rpc/test_dist_autograd_spawn.py DistAutogradTestWithSpawn.test_post_hooks

```

Differential Revision: D20953673

Pulled By: hczhu

fbshipit-source-id: 543b3844823330ea9f9856bab7c5cb2679290a53
2020-04-21 13:23:18 -07:00
97d3a8495d [reland][quant] QuantizedCUDA implementation (#36936)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36936

Closes https://github.com/pytorch/pytorch/issues/30813

Relanding of https://github.com/pytorch/pytorch/pull/35463

1. Tensor quantization logic(quantize_*) is moved to the aten/native/quantized. Previously all logic for tensor quantization lived in the aten/quantized/Quantizer.cpp file, and started to become complicated and hard to read. This problem should be addressed in refactoring PR. Still, I reworked this partially because I had to add tensor quantization logic for CUDA, and it was native to move everything to the aten/native/quantized.
2. Requirements to run CUDA_tensor_apply* was eased to process any tenser that lives on the CUDA device(QuantizedCUDA included).
3. All quantized data types now have a default constructor. NVCC refuses to compile any gpu_kernel or CUDA_tensor_apply* without them.
4. Minor changes in many files to register QuantizedCUDA backend.
5. test_quantized_tensor is extended to process QuantizedCUDA backend where possible.

Test Plan: Imported from OSS

Differential Revision: D21143025

Pulled By: jerryzh168

fbshipit-source-id: 11405e2e8f87e48fadc0a084c51db15f85ccb500
2020-04-21 13:18:52 -07:00
4efef475d7 [WIP] make test_distributed gloo test use MultiProcessTestCase (#36970)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36970

We would like to move all distributed testing to use the existing
multiprocessing tooling defined in common_distributed.py. With this change, we
make `TestDistBackend` inherit from `MultiProcessTestCase` and enable fork mode
multiprocessing. In the next step, we can enable spawn mode for these tests
which will give us TSAN coverage.
ghstack-source-id: 102553801

Test Plan: Unittests

Differential Revision: D21146947

fbshipit-source-id: 608fa2cb93e88f8de6a5ac87c523e2c4e4fede1b
2020-04-21 13:13:48 -07:00
6383373a04 [quant][graphmode] fused conv3d + relu (#36885)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/36885

Test Plan: Imported from OSS

Differential Revision: D21110359

Pulled By: z-a-f

fbshipit-source-id: d2590c5af13cdf5c843e68529ced8e3ce72d256e
2020-04-21 13:05:27 -07:00
2ccdc39dce Revert D21089648: Put TORCH_LIBRARY in torch/library.h; add custom class API
Test Plan: revert-hammer

Differential Revision:
D21089648

Original commit changeset: 8d54329c1252

fbshipit-source-id: 636e8a11afc628a4cdae9d44824985c10c70555e
2020-04-21 12:21:45 -07:00
a05406ea56 [clang-format] Disable progress bar if stdout is piped (#36955)
Summary:
**Summary**
This commit disables the progress bar for the `clang-format` binary
download if stdout is not attached to a terminal. The cursor
repositioning tricks used to print out the progress bar don't work if
stdout is redirected something that is not a terminal, and so the file
ends up containing each progress bar update on a separate line. This
happens in the GitHub workflow for checking formatting and is annoying
to scroll through.

**Test Plan**
1. Manual invocation of the script still produces progress bar.
```
(pytorch) me@devgpuXXX:pytorch  (disable-cf-progress-bar)$ with-proxy tools/clang_format.py
Downloading clang-format to /home/me/local/repos/pytorch/.clang-format-bin
0% |################################################################| 100%
Using clang-format located at /home/me/local/repos/pytorch/.clang-format-bin/clang-format
```
2. GitHub `clang-format` workflow output no longer contains progress bar.
```
Run set -eux
+ echo '~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~'
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+ echo '| Run tools/clang_format.py to fix formatting errors |'
| Run tools/clang_format.py to fix formatting errors |
+ echo '~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~'
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+ tools/clang_format.py --verbose --diff
Created directory /home/runner/work/pytorch/pytorch/.clang-format-bin for clang-format binary
Downloading clang-format to /home/runner/work/pytorch/pytorch/.clang-format-bin

Reference Hash: d1365110da598d148d8143a7f2ccfd8bac7df499
Actual Hash: d1365110da598d148d8143a7f2ccfd8bac7df499
Using clang-format located at /home/runner/work/pytorch/pytorch/.clang-format-bin/clang-format
All files formatted correctly
```
**Fixes**
This PR fixes https://github.com/pytorch/pytorch/issues/36949.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36955

Differential Revision: D21157861

Pulled By: SplitInfinity

fbshipit-source-id: 16c6d4395cee09f3bd2abac13e9be4acdde73406
2020-04-21 11:28:33 -07:00
71ec8b2002 Switches test_jit to use float32 as its default scalar type (#36982)
Summary:
Our test suite used to set double as its default scalar type, and when it was switched to not do so (to be more consistent with how users experience PyTorch), a few tests had to still set the default scalar type to double to function properly. Now that the jit no longer creates double tensors so frequently, it appears that test_jit no longer needs to set double as its default scalar type, too.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36982

Differential Revision: D21152120

Pulled By: mruberry

fbshipit-source-id: ea6d3c1ad55552dc5affa1fe1bd0e5189849e6d7
2020-04-21 11:23:28 -07:00
54f265249c Optimize grouped Conv3d performance (#36355)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36355

Resolving issue in https://github.com/pytorch/pytorch/issues/36155, by:
- supporting grouped conv3d in ```slow_conv3d```
- adding a fast path in ```__convolution``` to call ```slow_conv3d``` when
  running grouped conv3d on CPU
- bypassing unfolding when kernel_size = 1

Test Plan:
Added the following test cases in test_nn.py, testing both forward and
backward:
- test_Conv3d_groups_nobias
- test_Conv3d_groups_wbias
- test_Conv_1x1

Imported from OSS

Differential Revision: D20957073

fbshipit-source-id: 29afd1e6be8c484859eaedd51463954e2fdccc38
2020-04-21 11:17:07 -07:00
00b7d84eb7 Add a .with_cache() method to distributions.Transform objects (#36882)
Summary:
This resolves an issue observed by stefanwebb where the composition of multiple transforms is cached only if all components are cached.

This PR adds a new method `.with_cache()` so that e.g. you can compose a normalizing flow (that needs to be cached) with a `SigmoidTransform` (that wasn't already cached) by calling `.with_cache()` on the latter. This issue also comes up when composing non-cached constraint transforms as returned by `transform_to()` and `biject_to()`: after this PR you can call `transform_to(constraints.positive).with_cache()` to get a cached `ExpTransform`.

## Tested
- [x] added a unit test
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36882

Differential Revision: D21155914

Pulled By: ezyang

fbshipit-source-id: 3c06e63785ca2503e08a5cd7532aff81882835e9
2020-04-21 10:50:31 -07:00
01100cb477 Put TORCH_LIBRARY in torch/library.h; add custom class API (#36742)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36742

Now, you can define a custom class inside a TORCH_LIBRARY block.
It looks very similar to what you did before.  Instead of

```
static auto m = torch::class_<Class>("Namespace", "Class").def("foo", foo);
```

you write

```
TORCH_LIBRARY(Namespace, m) {
  m.class_<Class>("Class")
    .def("foo", foo);
}
```

All the old usages still work, but at some point we should start
updating the tutorials when we're ready to go 100% live with the
new pybind11 style API.

custom class API previously lived in torch/ folder and in torch
namespace, so for consistency, the new TORCH_LIBRARY also got
moved to torch/library.h The definition of Library::class_ is in the
bottom of that header because I need all of the class_ constructors
available, but there is a circular dependency between the two headers.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Differential Revision: D21089648

Test Plan: Imported from OSS

Pulled By: ezyang

fbshipit-source-id: 8d54329c125242605336c22fa1642aae6940b507
2020-04-21 10:05:21 -07:00
db84689c09 CMake/Ninja: fix dependencies for .cu files (#36938)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/26304

After this patch `build.ninja` entries for `.cu` files will contain a `depfile` variable pointing to a `.NVCC-depend` file containing dependencies (i.e., header files included directly or indirectly) of the `.cu` source file. Until now, those `.NVCC-depend` files were being transposed into `.cu.o.depend` files in CMake format. That did not work as intended because the `.cu.o` target file was declared to be dependent on the `.cu.o.depend` file itself, rather than its contents. In fact, Ninja lacks the functionality to process dependencies in the CMake format of those `.cu.o.depend` files.

This was tested on Linux as described in https://github.com/pytorch/pytorch/issues/26304#issuecomment-614667170
I have also verified that the original problem does not reproduce with Makefiles (i.e., when `ninja` is not present in the system) and that PyTorch still build successfully with Makefiles after this patch.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36938

Differential Revision: D21156042

Pulled By: ezyang

fbshipit-source-id: fda3aaa57207f4d6bf74d2f254fe45fb7fd90eec
2020-04-21 09:43:48 -07:00
246b208e4f make merge_fp32_into_fp16_inputs to generate ops for each partition (#36973)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36973

handle the case where inputs are used in multiple partitions

Test Plan: unit tests

Reviewed By: yinghai

Differential Revision: D21107672

fbshipit-source-id: 9eca20220b80f27400aefcdaeff5d5503e32654c
2020-04-21 09:27:36 -07:00
be52b7f0ea Documentation LU Decomposition: deriving L, U, and P (#36907)
Summary:
Add note to LU decomposition to use `lu_unpack` to get `L`, `U`, and `P`.

Fixes https://github.com/pytorch/pytorch/issues/36752.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36907

Differential Revision: D21134545

Pulled By: albanD

fbshipit-source-id: 54d4872bb8c95dfb8048aedace9781f843ab8a30
2020-04-21 07:40:21 -07:00
ff435a0e6b [pytorch] add test for empty tensor support in nn.Linear (#36983)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36983

fix https://github.com/pytorch/pytorch/issues/34202

it seems to be fixed now but without a test

Test Plan: sandcastle

Differential Revision: D21149623

fbshipit-source-id: 109f8e75a0826541ec7beb1920d5a38e0e826899
2020-04-21 01:15:26 -07:00
0f3af8529a Revert D20961463: [TEST] add ops for portal TTS model
Test Plan: revert-hammer

Differential Revision:
D20961463

Original commit changeset: 5022077caccd

fbshipit-source-id: 1f567020e94ac151ed7568c13b5c3cd61226d309
2020-04-21 01:04:04 -07:00
98c293c1ef Do not use VLAs in vec256_qint.h (#36855)
Summary:
Use `std::decay_t<decltype(foo)>::size()`  instead of `foo.size()` to help compiler with static array allocations.
Even if `Vec256::size()` is `constexpr`, `foo.size()` (where `foo` is of type `Vec256`) is not an integral constant expression, therefore compiler have to use VLAs, which are not part of C++ standard.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36855

Test Plan: CI

Differential Revision: D21151194

Pulled By: malfet

fbshipit-source-id: eaf3e467c7f7ee6798ca82fe9f8fae725011ead0
2020-04-21 00:59:23 -07:00
68f6b9873b [TEST] add ops for portal TTS model (#36971)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36971

Add lite interpreter ops for portal TTS model.

Test Plan:
Convert to lite interpreter model:
buck run //xplat/caffe2/fb/pytorch_predictor:converter <FULL_JIT_MODEL> <LITE_MODEL>

Load model using benchmark program (on devserver)
buck run //xplat/caffe2/fb/lite_predictor:lite_predictor_tts -- --model <MODEL>

(Expect benchmark to fail because inputs are invalid)

Reviewed By: iseeyuan

Differential Revision: D20961463

fbshipit-source-id: 5022077caccd8c07666789bbbca68c643129ee0a
2020-04-21 00:35:50 -07:00
3d2d5c82da Clean-up non-AVX variant of bitwise_binary_op template (#36966)
Summary:
Compute number of element as `constexpr` and use it as both `buffer` element size as well as for upper boundary
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36966

Differential Revision: D21150602

Pulled By: malfet

fbshipit-source-id: 581634565c54c7295f3b77c8dc86659d5cc4ce19
2020-04-21 00:29:04 -07:00
a1eb591ea6 fmadd in vec256_base should be on Vec256<T>, not T (#36751)
Summary:
This should have been intended to be the general version of fmadd in
vec256_double and vec256_float.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36751

Differential Revision: D21148849

Pulled By: pbelevich

fbshipit-source-id: 0805075d81c61d22383a3055aebcb91d09e545de
2020-04-20 21:30:57 -07:00
cdc1ca040a Enable test_hardsigmoid_grad_xla on pytorch side (#36967)
Summary:
hardsigmoid_backward is implemented in xla side so the test will not error out but is really slow due to a lot of recompile. Enable the test on the pytorch side but skip it in xla side so xla can control when to enable the test
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36967

Differential Revision: D21149113

Pulled By: ailzhang

fbshipit-source-id: fc337622fafa7be9cff2631de131980ea53adb8d
2020-04-20 21:21:59 -07:00
742d9796bc [ROCm] Enable wrongly skipped tests on CPU on ROCm (#36968)
Summary:
`skipIfRocm` skips the test on ROCm regardless of device type [CPU or GPU]. `skipCUDAIfRocm` skips only on GPU on ROCm and runs the test on CPU.

ezyang iotamudelta
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36968

Differential Revision: D21149721

Pulled By: ezyang

fbshipit-source-id: 361811b0b307f17193ad72ee8bcc7f2c65ce6203
2020-04-20 21:15:58 -07:00
ce0500eb4c Ensure linearIndex of advanced indexing backwards is contiguous. (#36959)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36959

This is a more straightforward solution to the problem than https://github.com/pytorch/pytorch/pull/36957; I don't know about the relative performance.

Fixes: #36956

Test Plan: Imported from OSS

Differential Revision: D21144146

Pulled By: gchanan

fbshipit-source-id: a10ab905219a73157d5d7183492b52d7c8dd6072
2020-04-20 20:52:55 -07:00
59f923e884 Update NNPI backend to 0.5.1.8 (#4397)
Summary:
Update of NNPI Backend to v0.5.1.8.
Pull Request resolved: https://github.com/pytorch/glow/pull/4397

Reviewed By: jfix71

Differential Revision: D20938791

Pulled By: arunm-git

fbshipit-source-id: 4f50d104db1ac274e92d9fce6fb86afd930d8511
2020-04-20 20:34:20 -07:00
346215caa4 [jit] Adding vectorized load/store support for JIT generated CUDA kernel (#36555)
Summary:
JIT pointwise kernel currently does not do vectorized load/store, which may lead to not optimal performance in shorter data types, like half and int8.

In this PR, a fixed length of 4 elements per load/store is added for supported tensor shape, implemented as a runtime check inside kernel.

Supported tensor shape:
- all input/output data point are aligned to 4*sizeof(dtype)
- last dimension contiguous(stride 1) and size is multiple of 4
- all other dimension have stride that is multiple of 4

All test_jit* passed, and here is performance result on a simple `ax+by+c` fusion
result before PR:
```
torch.float32 kernel time: 0.748 ms.
torch.float16 kernel time: 0.423 ms.
torch.int8 kernel time: 0.268 ms.
```
result after PR:
```
torch.float32 kernel time: 0.733 ms.
torch.float16 kernel time: 0.363 ms.
torch.int8 kernel time: 0.191 ms.
```
test code:
```
import torch
import time

# disable profiling to test all data types
torch._C._jit_set_profiling_mode(False)
torch._C._jit_set_profiling_executor(False)

torch.jit.script
def axpby(x, y):
    return  x * 2 - y * 3 + 1

for test_dtype in [torch.float32, torch.float16, torch.int8]:
    a = torch.randn(12345,4096, device="cuda").to(test_dtype)
    b = torch.randn(12345,4096, device="cuda").to(test_dtype)

    # warm up
    for _ in range(100):
        c = axpby(a,b)

    torch.cuda.synchronize()
    start = time.time()

    for _ in range(1000):
        c = axpby(a,b)

    torch.cuda.synchronize()
    end = time.time()
    print("{} kernel time: {:.3f} ms.".format(test_dtype, end-start))
```
Generated code:
[log_with_generated_code.txt](https://github.com/pytorch/pytorch/files/4472813/log_with_generated_code.txt)

Additional note:
double type is disabled from vectorized code path.

We can later improve it with dynamic vectorization length support and less in-kernel check when we can use tensor shape information in codegen. For now, this implementation is following cache through TensorDesc mechanism, which does not have enough compile time information.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36555

Differential Revision: D21142762

Pulled By: ngimel

fbshipit-source-id: 1cfdc5807a944c4670b040dc2d2dfa480377e7d7
2020-04-20 19:24:28 -07:00
3ae70cb847 Add RecordFunctionGuard (#36215)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36215

Make it possible to disable observers, e.g. to avoid
infinite recursion if an observer uses an operator

Test Plan:
USE_BLAS=MKL USE_MKLDNN=0 USE_CUDA=0 python setup.py develop install
./build/bin/test_jit

Differential Revision: D20912676

Pulled By: ilia-cher

fbshipit-source-id: 29760cdfe488a02f943f755967b78779d6dbcef3
2020-04-20 19:19:14 -07:00
a14a8376aa Link NCCL lib to TORCH_PYTHON_LINK_LIBRARIES when USE_NCCL=1 (#36948)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36948

Compiling with USE_DISTRIBUTED=0 fails as it would still try to
compile python_nccl.cpp which requires NCCL but the NCCL lib is not
linked.

Test Plan: Imported from OSS

Differential Revision: D21142012

Pulled By: mrshenli

fbshipit-source-id: 6ca94056ca859da7f833a31edcb4c5260d8625e4
2020-04-20 19:15:02 -07:00
32307efd68 Fix flaky test_barrier_timeout* tests for test_distributed. (#36963)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36963

A couple of reasons why these tests were flaky:
1) Sometimes the error message for timeout would include lowercase 'timeout'
which was not included in the regex.
2) The timeout was 0.2 seconds, which was probably too small for ASAN/TSAN.
ghstack-source-id: 102541231

Test Plan: waitforbuildbot

Differential Revision: D21144954

fbshipit-source-id: 57945f53e1627028835cbfd2adb72f21d87f593f
2020-04-20 18:58:16 -07:00
5e504e83e8 Add sync-point insertions and block/thread local memory allocations (#36563)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/36563

Test Plan: Imported from OSS

Differential Revision: D21014238

Pulled By: zheng-xq

fbshipit-source-id: 4d61ff2f76345ea2825f2d5f60a771f65b24ad69
2020-04-20 18:52:30 -07:00
0647f34477 Delete docker build job for pytorch-linux-bionic-clang9-thrift-llvmdev (#36930)
Summary:
Seems like no one is using this image. We could delete it from our docker hub.
I think we don't need to regenerate a new set of images, since we are only deleting. But please correct me if I'm wrong.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36930

Reviewed By: malfet

Differential Revision: D21138079

Pulled By: ailzhang

fbshipit-source-id: 4a563e6310b193cb885411bcd925296b01223368
2020-04-20 17:58:36 -07:00
c03d149483 [quant][graph] Add quantizedmul_relu and quantized::mul_scalar_relu ops (#36844)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/36844

Test Plan:
python test/quantization/test_quantize_script.py TestQuantizeScriptPTSQOps.test_quantized_mul_relu
python test/quantization/test_quantize_script.py TestQuantizeScriptPTSQOps.test_quantized_mul_scalar_relu

Imported from OSS

Differential Revision: D21134440

fbshipit-source-id: 483fb30066cebf659b2f3be22c18d389c7972f81
2020-04-20 15:42:06 -07:00
ee2a9ac56e [quant][graph] Support for quantized::mul and quantized::mul_scalar (#36818)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/36818

Test Plan:
python test_quantize_script.py test_quantized_mul
python test_quantize_script.py test_quantized_mul_scalar

Imported from OSS

Differential Revision: D21134438

fbshipit-source-id: 9ed5e852c5c0c6899a11e3ed36e12b5045608ea4
2020-04-20 15:40:32 -07:00
1c15cb4773 Add bundled input support to speed_benchmark_torch (#36765)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36765

We recently added support for bundling inputs with models.  Now add
support to the benchmarker to use those inputs.  This frees users from
having to look up the proper input format for each model.

Test Plan:
- Ran on a model without bundled inputs.  Saw a clear error.
- Ran on a model with too few bundled inputs.  Saw a clear error.
- Ran on a proper bundled input.  Model executed.

Differential Revision: D21142659

Pulled By: dreiss

fbshipit-source-id: d23c1eb9d1de882345b007bf2bfbbbd6f964f6fe
2020-04-20 15:32:57 -07:00
dbdd0f50f4 [quant] Minor refactor in fused conv names (#36883)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/36883

Test Plan: Imported from OSS

Differential Revision: D21110360

Pulled By: z-a-f

fbshipit-source-id: f7268a8004432254aa54525854bed059a1a6a350
2020-04-20 15:24:39 -07:00
28f439d4f4 add absolute alias for abs (#36597)
Summary:
Adds an absolute alias for the abs function to match Numpy's use of both:
https://docs.scipy.org/doc/numpy/reference/generated/numpy.absolute.html

Adds test to ensure the output from abs and absolute are the same.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36597

Differential Revision: D21024458

Pulled By: jessebrizzi

fbshipit-source-id: 4f2987e7bc7cde444d0a93e833a0350844b48d44
2020-04-20 14:49:51 -07:00
4d171c0ed9 hardsigmoid: add PyTorch wrapper for the QNNPACK path (#36699)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36699

Hooks up the QNNPACK op from the previous PR to work in the
PyTorch layers.

Test Plan:
```
python test/quantization/test_quantized.py TestQuantizedOps.test_qhardsigmoid
python test/quantization/test_quantized.py TestQNNPackOps.test_qhardsigmoid
```

Imported from OSS

Differential Revision: D21057152

fbshipit-source-id: 5f2094d1db80575f7f65497f553ca329f7518175
2020-04-20 14:20:28 -07:00
1d720228d2 hardsigmoid operator for QNNPACK, using LUTs (#36698)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36698

Adds the hardsigmoid op to QNNPACK using the LUT kernel

Test Plan:
```
cd aten/src/ATen/native/quantized/cpu/qnnpack
with-proxy ./scripts/build-local.sh
./build/local/hardsigmoid-test
```

Imported from OSS

Differential Revision: D21057153

fbshipit-source-id: 31ce09643959b159a82e7083fc11e1e5e98c49ce
2020-04-20 14:18:59 -07:00
97f2513c26 Canonicalize includes in aten, and add tests for it (#36301)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/36301

Test Plan: Imported from OSS

Differential Revision: D20943004

Pulled By: ezyang

fbshipit-source-id: 4f4d3a5be40f3caedea94fab11d7c7810913ddc1
2020-04-20 14:14:43 -07:00
47023148ee Convert C casts to static casts (UnaryOpsKernel) (#36400)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/36400

Differential Revision: D21067974

Pulled By: ezyang

fbshipit-source-id: 6b99feef349fb97f3b459eab805a7f3d923f9f06
2020-04-20 14:09:38 -07:00
a2951a1ea1 [quant][graph] Update quantize_dynamic_script API to take sample model args (#36817)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36817

For dynamic quant we need to run the observers on the weights to calculate the qparams before calling convert on the model.
The API requires the user to provide dummy inputs that will be fed to the model after the prepare step to run the observers

Test Plan:
test_quanitze_script.py
test_quantization.py

Imported from OSS

Differential Revision: D21134439

fbshipit-source-id: 8acaab4eb57aadb68a2a02cc470bb042c07e1f6b
2020-04-20 13:45:51 -07:00
63e9d95c12 Remove hacked twins from codegen (#36666)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36666

We need to introduce hacked twin overloads for ops taking lists of optional tensors.
I'm not really sure why actually, but them being a special case in codegen blocks removal of `use_c10_dispatcher: unboxed_only`.
This PR does not remove the "hacked twin" hack, but it removes it from codegen, instead manually specifying them in register_prim_ops.cpp and unblocking removal of `use_c10_dispatcher: unboxed_only`.

Original commit changeset: c5e2386ad06a
ghstack-source-id: 102507901

Test Plan: waitforsandcastle

Differential Revision: D21044962

fbshipit-source-id: 9d423aac08a1dd2bab54940ccb6219ebdcb7d230
2020-04-20 13:40:14 -07:00
4e365b9cd1 [Distribution] Implement kl divergence for Cauchy distribution (#36477)
Summary:
Implement closed-form kl divergence between cauchy distribution

### Reference:
https://arxiv.org/pdf/1905.10965.pdf
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36477

Differential Revision: D21134487

Pulled By: ezyang

fbshipit-source-id: 69d2cc2237aa931f224c3807baee7c63f91583fc
2020-04-20 13:27:11 -07:00
4d2502a0c2 fix explicitly defaulted constexpr assignment operator fails to compile error for gcc 5.3.0 (#36561)
Summary:
gcc 5.3.0 has an issue which can't define default function as constexpr, see  https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68754. for works for gcc 5.3.0, not define default function as constexpr function now.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36561

Differential Revision: D21024109

Pulled By: ezyang

fbshipit-source-id: 58fce704625b7d0926e40b6b12841ebbe392c59c
2020-04-20 13:19:10 -07:00
e0e70589ef [quant][graphmode] tanh pattern and test (#36880)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/36880

Test Plan: Imported from OSS

Differential Revision: D21110105

Pulled By: z-a-f

fbshipit-source-id: 45b33ec203c333c1b40376bcf768f2a0eaa8cc5e
2020-04-20 12:44:38 -07:00
752d3c281a [profiler] Allow record_function ctx manager to profile futures (#35055)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35055

This is the first step to improving the way RPCs are profiled as suggested by Ilia. For now, since RPC can return two different types of futures, we have to implement two different code paths, one for the python eager mode future and one for the jit future.

This diff implements the python eager part. We have defined a method `_call_end_callbacks_on_future` that takes in a future and schedules a `RecordFunction` to be completed as a callback on the future.

Once https://github.com/pytorch/pytorch/pull/35039 lands, we can implement the JIT codepath by registering an operator that takes a `Future(t)` as well.

These code paths will be merged once the futures are merged.
ghstack-source-id: 102478180

Test Plan: Added unit tests

Differential Revision: D20452003

fbshipit-source-id: 1acdcb073bd1f63d6fb2e78277ac0be00fd6671d
2020-04-20 12:37:54 -07:00
1e054bfbdc Report error for lt, le, gt, ge in complex Vec256 (consistent with <, <=, >, >=) (#36646)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/36646

Differential Revision: D21089498

Pulled By: ezyang

fbshipit-source-id: 8df33f8ef7070eea6132f355e507f479f6ca6080
2020-04-20 12:23:12 -07:00
dc4d888193 ROCm: don't warn about CUDA compute capabilities (#35949)
Summary:
The warning doesn't apply for ROCm.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35949

Differential Revision: D21050540

Pulled By: ezyang

fbshipit-source-id: 23b13bddd13f132c2017ddea12b2dc54f611fba6
2020-04-20 11:53:56 -07:00
2fa17dedac add a fast path for EmbeddingBag calling FBGEMM (#36679)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/36679

Test Plan:
Imported from OSS

Unit tests:
python test/run_test.py -i test_nn -- TestNNDeviceTypeCPU.test_EmbeddingBag_per_sample_weights_failures_cpu
python test/run_test.py -i test_nn -- TestNNDeviceTypeCPU.test_EmbeddingBag_per_sample_weights_and_offsets_cpu
python test/run_test.py -i test_nn -- TestNNDeviceTypeCPU.test_EmbeddingBag_per_sample_weights_and_new_offsets_cpu
python test/run_test.py -i test_nn -- TestNNDeviceTypeCPU.test_EmbeddingBag_per_sample_weights_and_no_offsets_cpu
python test/test_nn.py TestNN.test_embeddingbag_from_pretrained
python test/test_nn.py TestNN.test_embeddingbag_from_pretrained_options

Finally run: python test/test_nn.py

Reviewed By: supriyar

Differential Revision: D21058034

Pulled By: xing-liu

fbshipit-source-id: 8fef39078132f63c406976d6b76c51f9ce573f90
2020-04-20 11:39:16 -07:00
68f847c4c6 [rpc] Remove redundant call to createExceptionResponse (#36857)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36857

This was a redundant call as we immediately took the msg and conerted
it back to a string
ghstack-source-id: 102424018

Test Plan: CI

Differential Revision: D21104235

fbshipit-source-id: 4124007d800dbe2718ddebb40281d0a86484685e
2020-04-20 11:30:37 -07:00
399f494d22 Add at::aten::hardsigmoid symbol (#36851)
Summary:
This will allow xla to use this stmbol when lowering the hardsigmoid
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36851

Differential Revision: D21102827

Pulled By: ailzhang

fbshipit-source-id: 99429a40a61ba84eb38b872cb3656aa5a172b03b
2020-04-20 11:17:20 -07:00
13391cebe2 ai-pep: match the qlinear benchmark to linear (#36674)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36674

Slight changes to qlinear benchmark to have it be in the same format
as linear, for fairer comparisons between FP and Q.

Test Plan:
```
cd benchmarks/operator_benchmark
python -m pt.linear_test
python -m pt.qlinear_test
```

Imported from OSS

Differential Revision: D21102562

fbshipit-source-id: 4f5c693b5de7e26c4326a9ec276560714290f6c6
2020-04-20 09:46:32 -07:00
25649684ed ai-pep: align qconv benchmark to conv (#36673)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36673

Slight changes to the qconv benchmark to make it match the floating
point benchmark, so we can compare across the two better.

Test Plan:
```
cd benchmarks/operator_benchmark
python -m pt.qconv_test --tag_filter all
python -m pt.conv_test --tag_filter all
```

Imported from OSS

Differential Revision: D21102563

fbshipit-source-id: d11c1e4c13d4c5fa1f2332c687aee6889c81b659
2020-04-20 09:44:09 -07:00
c7cf4c1bd6 Bmm sparse dense (#33430)
Summary:
Add sparse-dense BMM operation for CUDA and CPU.

Closes https://github.com/pytorch/pytorch/issues/5672
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33430

Differential Revision: D21017828

Pulled By: ezyang

fbshipit-source-id: 5bf60efcb16d05c08c7a284accc04d8968f98752
2020-04-20 09:35:16 -07:00
30e7055ed7 Revert D21078446: [pytorch] Route default warning sync to LOG(WARNING)
Test Plan: revert-hammer

Differential Revision:
D21078446

Original commit changeset: b5d36aac54d6

fbshipit-source-id: adff2d7e396b2efdd29eeabfe393fbc55edbe635
2020-04-20 00:26:56 -07:00
0f0d69009e Makes CUDA -float->uint8 cast consistent with CPU (#36832)
Summary:
Addresses https://github.com/pytorch/pytorch/issues/36807. Also updates the cast testing to catch issues like this better.

In the future a more constexpr based approach to casting would be nice.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36832

Differential Revision: D21120822

Pulled By: mruberry

fbshipit-source-id: 9504ddd36cfe6d9f9f545fc277fef36855c1b221
2020-04-19 23:33:38 -07:00
9d5dda7c2f [pytorch] Route default warning sync to LOG(WARNING) (#36768)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36768

Follow LOG(WARNING) format for c++ side warnings in order to play well with larger services, especially when using glog. I need to hook up into GLOG internals a bit in order to override FILE/LINE without having to change the whole thing to be macros, but it seems to be stable between glog versions.

Note, this also changes caffe2_log_level to warning by default - I think it's a much better default when compiling without glog (or maybe even have info)

Test Plan:
Run unittest in both glog and non-glog build mode:

glog:
```
W0416 12:06:49.778215 3311666 exception_test.cpp:23] Warning: I'm a warning (function TestBody)
```

no-glog:
```
[W exception_test.cpp:23] Warning: I'm a warning (function TestBody)
```

Reviewed By: ilia-cher

Differential Revision: D21078446

fbshipit-source-id: b5d36aac54d6b6295a72de6754696ccafbcb84ca
2020-04-19 23:02:55 -07:00
49457a7be7 Logging for ATen op subtype
Summary: ATenOp should go away, but before it does it's important to understand what's going inside of it. We already log `arguments`, but it's rather hard to parse in scuba as its a list, not a dictionary. Let's extract operator name explicitly so that grouping works well

Test Plan: unittest

Reviewed By: ngimel

Differential Revision: D21057966

fbshipit-source-id: 86be7cca39055620477a28bd5d8ab29e8edd2ff9
2020-04-19 23:02:50 -07:00
246e9abf3f Backward-compatible workaround for ATenOp index with dtype=uint8 (#36667)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36667

Hacky workaround that would allow us to reland https://github.com/pytorch/pytorch/pull/34418

Basically moves the type conversion into ATenOp wrapper that is still used in some models.

Test Plan: Added unittest. Before it was producing warnings about wrong dtype, with this fix it doesn't

Reviewed By: ngimel

Differential Revision: D21037368

fbshipit-source-id: 06b435525d8d182c7607e33fd745060d3d6869e9
2020-04-19 23:00:58 -07:00
60c3060621 Remove CUDA9Workarounds.cuh (#36840)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/36840

Differential Revision: D21121709

Pulled By: ngimel

fbshipit-source-id: d9319de2511ca660869ec2eafdc1b6edddcf2f51
2020-04-19 22:49:49 -07:00
3c55b5a8ef Update persons_of_interest.rst 2020-04-19 20:26:02 -07:00
1341ea4802 Fix MaxPool3d CUDA backward incorrect results for non-square output (#36820)
Summary:
In the CUDA version of max_pool3d backward, function  `max_pool3d_with_indices_backward_out_frame` is defined with args as `..., oheight, owidth, ...` but called with `..., owidth, oheight, ...`. As a result gradients are not fully calculated along the longer dimension due to insufficient grid size.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36820

Differential Revision: D21120078

Pulled By: ngimel

fbshipit-source-id: d061726647a4a45d45d5c1a00f2f1cf2745726a8
2020-04-19 18:05:02 -07:00
1b3741aa7f [WIP] reenable bfloat16 masked_select (#36859)
Summary:
Try reenabling bfloat16 masked_select, see it windows tests pass.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36859

Differential Revision: D21109535

Pulled By: ngimel

fbshipit-source-id: ca260943e6575d8e788e9fd87161a0d40d3d44fb
2020-04-19 15:41:32 -07:00
be9748f226 Minor tweak of FakeLowp CMakefile (#36861)
Summary:
On some machine I found error like cannot find `cpuinfo.h` when building FakeLowp ops. Fixing it. Also updated the README.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36861

Reviewed By: amylittleyang

Differential Revision: D21105755

Pulled By: yinghai

fbshipit-source-id: 4f17bd969d38e1b2945b8753ffe4bdc703de36bf
2020-04-19 15:02:26 -07:00
49b10c58a3 Revert D20896697: [pytorch][PR] QuantizedCUDA implementation
Test Plan: revert-hammer

Differential Revision:
D20896697

Original commit changeset: 163554efa23d

fbshipit-source-id: e3e370ef7c8be68ea34368dfcc7a7efc9d1f8761
2020-04-19 12:41:51 -07:00
f6daa6220e QuantizedCUDA implementation (#35463)
Summary:
Closes https://github.com/pytorch/pytorch/issues/30813

1. Tensor quantization logic(quantize_*) is moved to the aten/native/quantized. Previously all logic for tensor quantization lived in the aten/quantized/Quantizer.cpp file, and started to become complicated and hard to read. This problem should be addressed in refactoring PR. Still, I reworked this partially because I had to add tensor quantization logic for CUDA, and it was native to move everything to the aten/native/quantized.
2. Requirements to run CUDA_tensor_apply* was eased to process any tenser that lives on the CUDA device(QuantizedCUDA included).
3. All quantized data types now have a default constructor. NVCC refuses to compile any gpu_kernel or CUDA_tensor_apply* without them.
4. Minor changes in many files to register QuantizedCUDA backend.
5. test_quantized_tensor is extended to process QuantizedCUDA backend where possible.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35463

Differential Revision: D20896697

Pulled By: jerryzh168

fbshipit-source-id: 163554efa23d11a2b10bbc2492439db4798eb26b
2020-04-19 08:33:16 -07:00
54ed6fd3ee Use both absolute and relative tolerance in testing (#34258)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34258

This PR allows both atol and rtol to be specified, uses defaults based on the prior analysis (spreadsheet attached to https://github.com/pytorch/pytorch/pull/32538), but retains the absolute tolerance behavior in cases where precision was previously specified explicitly.

Test Plan: Imported from OSS

Differential Revision: D21110255

Pulled By: nairbv

fbshipit-source-id: 57b3a004c7d5ac1be80ee765f03668b1b13f4a7e
2020-04-19 06:16:49 -07:00
3aec9f7924 [AIDemos] Add missing operators for AIDemos (#36756)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36756

1. Add missing ops for pytorch models used in AIDemos. This is because we're migrating towards the lite-interpreter on mobile, the full JIT version will be deprecated
2. Replace the old mobilenet model with a newer one in bytecode format
3. Regenerate the reddit model to include bytecode
ghstack-source-id: 102422498

Test Plan: `buck build AIDemos:AIDemos`

Reviewed By: iseeyuan, linbinyu

Differential Revision: D21013409

fbshipit-source-id: 7704d32fccfe61a2c9db38846ce3153bb93eee7f
2020-04-19 02:09:26 -07:00
b0b9e704ed [nnpi glow unit test] SLS tests shape sweep with hypothesis testing (#36833)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36833

Add hypothesis testing sweep for one test in each SLS test suite for different precisions.

Sweep random seed, embedding shape, batch_size, weight with hypothesis testing.

Refactor sls tests into proper file with precision labeled in filename.

Test Plan:
FB intern: buck test mode/dev //caffe2/caffe2/contrib/fakelowp/test:test_sls_8bit_nnpi_fp32nnpi

Will test OSS after exporting PR.

Reviewed By: yinghai

Differential Revision: D21098346

fbshipit-source-id: af167118e5289bb7178ffc27aaec3af101dcd828
2020-04-18 20:43:25 -07:00
8b685a8af0 C++ make constructor NamedAnyModule(name,any) public (#36869)
Summary:
Allows creation of _NamedAnyModule_ directly from _AnyModule_, e.g.

```
  auto a=torch::nn::AnyModule(torch::nn::Linear(1,2));
  auto m=torch::nn::NamedAnyModule("fc", a);
```
Without the public constructor, it would be necessary to recast the AnyModule to underlying type,
then have the constructor cast it back to AnyModule.

With the public AnyModule constructor,
possible to do
```
auto q=Sequential({m});
```
or
```
q->push_back(m.name, m.module());
```

(works in conjunction with PR https://github.com/pytorch/pytorch/issues/36720 which allowed adding _AnyModule_ directly)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36869

Differential Revision: D21110074

Pulled By: yf225

fbshipit-source-id: aaea02282b9024824785e54d8732c0a12c850977
2020-04-18 20:08:15 -07:00
6ba734bae9 Vectorize reduction when reducing on fastest striding dimension [resubmit] (#36873)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/36873

Differential Revision: D21109194

Pulled By: ngimel

fbshipit-source-id: eb18c6b4394f19a6c5eca45ef4ce97d623e051bd
2020-04-18 16:27:00 -07:00
136d84dd38 Enhance error message for MPI unavailability. (#36781)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36781

Mention that you need to to build PyTorch from source to enable MPI.
Additional context:
https://discuss.pytorch.org/t/distributed-pytorch-with-mpi/77106.
ghstack-source-id: 102341246

Test Plan: waitforbuildbot

Differential Revision: D21082009

fbshipit-source-id: 3a3286349e71322726a341dfc743b5978c7d9a56
2020-04-18 14:45:44 -07:00
d933ec14ce [c10] Fix the hanlding for Caffe2 ops which return tensor list (#36841)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36841

right now, all c2 ops's output will be unwrapped blindly. This is not correct, if we have a single tensor list returned.

Test Plan: buck test mode/dev-nosan mode/no-gpu //caffe2/caffe2/fb/python/operator_test:torch_integration_test

Reviewed By: alyssawangqq

Differential Revision: D21100463

fbshipit-source-id: 9f22f3ddf029e7da9d98008d68820bf7f8239d4f
2020-04-18 13:30:43 -07:00
0e6c66493a [engine] Ensure future is complete when exiting Engine::mark_graph_task_completed() (#36856)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36856

Previously, we could early-exit mark_graph_task_completed() without the future
actually being fully complete - we were only guaranteeing that it was at least
in the process of being marked complete.

This seems to be triggering an assert graph_task->future_result_->completed()

This change simply adds a 1-line waitNoThrow() call to ensure that the future
has been marked complete before exiting the mark_graph_task_completed() function.
The cost is relatively reasonable, since this isn't the common path.
ghstack-source-id: 102423589

Test Plan: buck test mode/dev-nosan caffe2/test/,,,

Differential Revision: D21104121

fbshipit-source-id: 51c1554618880fe80d52d5eb96716abc15f6be8a
2020-04-18 07:47:09 -07:00
5d9b4d5720 Update contribution_guide.rst (#36438)
Summary:
Fix formatting: change "Frequently Asked Questions" into an RST header, which is clickable and one get a URL of the FAQ section
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36438

Differential Revision: D21106180

Pulled By: mruberry

fbshipit-source-id: 370dafd1883bd57285b478cf2faa14ae2f86e3ba
2020-04-18 02:27:38 -07:00
2e93808cde Update functional.py (#36600)
Summary:
Fix a latex typo in the docstring.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36600

Differential Revision: D21106164

Pulled By: mruberry

fbshipit-source-id: b611f0eac209b59b06ace1017e418a68341c4105
2020-04-18 02:16:54 -07:00
197c85fcbc Use hypothesis to generate seed (#36860)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36860

att

Test Plan:
```
buck test mode/dev //caffe2/caffe2/contrib/fakelowp/test:test_batchnorm_nnpi_fp16nnpi -- 'test_bn \(caffe2\.caffe2\.contrib\.fakelowp\.test\.test_batchnorm_nnpi_fp16\.BatchnormTest\)'
```

Reviewed By: hyuen

Differential Revision: D21085268

fbshipit-source-id: becc25d6e8841dc25842d9397ecf500f4da6d2f7
2020-04-17 22:15:53 -07:00
57c50db441 [reland][quant] Add backward compatiblity test (#36842)
Summary:
re-created the same PR: https://github.com/pytorch/pytorch/pull/36639
because ghimport does not support importing binary files right now
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36842

Test Plan: python test/quantization/test_backward_compatibility.py

Differential Revision: D21100689

Pulled By: jerryzh168

fbshipit-source-id: 625a0f9da98138c9c2891b9d99fc45d85fa27cca
2020-04-17 21:24:31 -07:00
b245b1d23e Open source fbgemm fp16 pack op (#36791)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36791

This should enable `test_fc_nnpi_fp16.py` test in fakelowp test.

Test Plan:
```
buck test caffe2/caffe2/fb/fbgemm:
```

Reviewed By: hyuen

Differential Revision: D21085221

fbshipit-source-id: 512bca2eea1a4cc5d11129cfe9e65e7a4a0ba1e0
2020-04-17 21:00:52 -07:00
1b1a6a90c0 Open source fakefp16 BatchMatMul op (#36789)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36789

ATT. This should enable test_batchmatmul_nnpi_fp16.py test.

Test Plan:
```
buck test caffe2/caffe2/fb/fbgemm:
```

Reviewed By: hyuen

Differential Revision: D21084876

fbshipit-source-id: d2b4a4c44ad5cf454cefbe6a4cafdf110c6d35cb
2020-04-17 21:00:47 -07:00
b08494eb19 Use hypothesis to to control the rand seed (#36717)
Summary:
This will increase the test coverage.

More tests to enable:
```
test_fc_nnpi_fp16.py
test_op_nnpi_fp16.py
test_batchmatmul_nnpi_fp16.py
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36717

Reviewed By: hyuen

Differential Revision: D21061269

Pulled By: yinghai

fbshipit-source-id: 99277db8ff23f0f4e679f7b8955cfa305916e7a7
2020-04-17 20:59:15 -07:00
dc1f9eee53 Avoid printing erroneous warning about "MIOpen not found" for ROCm builds (#33837)
Summary:
Older versions of MIOpen (<=2.2) don't have the `miopenGetVersion` api, but MIOpen is always a part of the ROCm builds, so do NOT set `lib` to None for ROCm builds. `__cudnn_version` will be `None` for older versions of MIOpen.

Setting `lib` to `None` ends up printing the following erroneous warning when running unit tests:
```
/root/.local/lib/python3.6/site-packages/torch/backends/cudnn/__init__.py:120: UserWarning: cuDNN/MIOpen library not found. Check your LD_LIBRARY_PATH
  }.get(sys.platform, 'LD_LIBRARY_PATH')))
```
Eg.: https://ci.pytorch.org/jenkins/job/pytorch-builds/job/py3.6-clang7-rocmdeb-ubuntu16.04-test2/18387/consoleFull
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33837

Differential Revision: D20369285

Pulled By: xw285cornell

fbshipit-source-id: e82e6f8f5bccb486213cf868f40aece41ce11f98
2020-04-17 20:31:01 -07:00
6963973d5b Print GPU info for ROCm test runs (#36827)
Summary:
Printing the GPU device info for ROCm test runs could aid in triaging device-specific issues. Printing gfx generation and device Name for now.
Sample output:
```
  Marketing Name:          AMD EPYC 7702 64-Core Processor
  Marketing Name:          AMD EPYC 7702 64-Core Processor
  Name:                    gfx906
  Marketing Name:          Device 66a1
  Name:                    gfx906
  Marketing Name:          Device 66a1
  Name:                    gfx906
  Marketing Name:          Device 66a1
  Name:                    gfx906
  Marketing Name:          Device 66a1
  Name:                    gfx906
  Marketing Name:          Device 66a1
  Name:                    gfx906
  Marketing Name:          Device 66a1
  Name:                    gfx906
  Marketing Name:          Device 66a1
```
cc iotamudelta
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36827

Differential Revision: D21104532

Pulled By: xw285cornell

fbshipit-source-id: 167ce6b762e7f85d22addad755bfdcde8d111788
2020-04-17 20:13:14 -07:00
a64ea8ea04 Back out "Vectorize reduction when reducing on fastest striding dimension" (#36854)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36854

Original commit changeset: ea3f7f29709c

Test Plan: n/a

Differential Revision: D21103684

fbshipit-source-id: e4862b32bf9815486e5fa7e05b9816550e9b0263
2020-04-17 19:53:30 -07:00
681158e211 Print all test output while running unit tests in bazel (#36825)
Summary:
Closes https://github.com/pytorch/pytorch/issues/36595
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36825

Test Plan: CI

Differential Revision: D21100064

Pulled By: malfet

fbshipit-source-id: 7aef30bc5c4dc9cfdaaf10c7aa9888647e4a3ef5
2020-04-17 18:06:58 -07:00
f767de608c [tensorboard] Add strings to image boxes (#30941)
Summary:
Addresses https://github.com/pytorch/pytorch/issues/27300

sample usage:
```python
import torch
from torch.utils.tensorboard import SummaryWriter
with SummaryWriter() as w:
     w.add_image_with_boxes('imagebox_label', torch.ones(3, 240, 240) * 0.5,
             torch.Tensor([[10, 10, 100, 100], [101, 101, 200, 200]]),
             global_step=0, labels=['label1', 'label2'])
```
![image](https://user-images.githubusercontent.com/2005323/70387144-53580b80-19dc-11ea-91a1-9275de13ca79.png)

cc sanekmelnikov orionr
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30941

Differential Revision: D21083617

Pulled By: natalialunova

fbshipit-source-id: b451b701159eecc0ea0bece96ec69f69f5432791
2020-04-17 17:58:43 -07:00
4668d47d1f Add build_variable.bzl to CMAKE_RERUN target (#36809)
Summary:
`configure_file` command adds its input as a top-level dependency triggering make file regeneration if file timestamp have changed
Also abort CMAKE if `exec` of build_variables.bzl failed for some reason
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36809

Test Plan: Add invalid statement to build_variables.bzl and check that build process fails

Differential Revision: D21100721

Pulled By: malfet

fbshipit-source-id: 79a54aa367fb8dedb269c78b9538b4da203d856b
2020-04-17 17:28:07 -07:00
86f354c530 Python binding api to optimize for mobile model on script module. (#36357)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36357
ghstack-source-id: 101907180

Creating a python api entry to optimize mobile model which takes a scripted module as argument and returns an optimized scripted module. The initial optimization features includes inserting and folding prepack ops.

Test Plan: python test/test_optimizer.py

Differential Revision: D20946076

fbshipit-source-id: 93cb4a5bb2371128f802d738eb26d0a4f3b2fe10
2020-04-17 16:21:27 -07:00
fac076a82c [pytorch] move prim::TupleIndex from register_prim_ops_fulljit to register_prim_ops (#36808)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36808

Trying to run a model on mobile and prim::TupleIndex is reported as missing.  Moving it out from fulljit.

Reviewed By: linbinyu

Differential Revision: D21065879

fbshipit-source-id: d7a6dc9e5ad306d76825eaef815ab5582d4bf9a1
2020-04-17 16:19:25 -07:00
0a8a012005 [RELAND] Port to quantized and other operators to new registration API (#36799)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36799

This is a roll up of a bunch of small PRs for ease of landing.

- Update reference to RegisterOperators in error message in Convolution.
- Add explicit schema for quantized conv/conv_prepack (fixes #36511)
- Add a centralized TORCH_LIBRARY declaration for quantized and xnnpack ops (fixes #36510)
- Port to new registration API:
  - Resize
  - detach/detach_
  - All quantization operators
- Update quantized README for registering operators with new API

Test Plan: Imported from OSS

Differential Revision: D21089649

Pulled By: ezyang

fbshipit-source-id: 3dd205c2c075f6a3d67aadb2b96af25512e7acd0
2020-04-17 15:46:23 -07:00
d7608c7f56 Move DICT ops to lite interpreter (#36816)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36816

Move all DICT ops to lite interpreter

Test Plan: build

Reviewed By: iseeyuan

Differential Revision: D21085335

fbshipit-source-id: dd33e5846cea699cadb1801af5dbff68b7f542a0
2020-04-17 15:24:19 -07:00
cc5befc461 [Format] format a few files (#35187)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35187

When I touch these files, lint will always introduce some unintended change, to prevent it from happening, we need to format the code first.
change is generated by:
  arc f

Test Plan: integration test.

Differential Revision: D20587596

fbshipit-source-id: 512cf6b86bd6632a61c80ed53e3a9e229feecc2a
2020-04-17 14:30:01 -07:00
54a1e8509c Reduce binary size of schema inference (#34735)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34735

ghstack-source-id: 102255215

Test Plan: testinprod

Differential Revision: D20448172

fbshipit-source-id: 1e5580c9e7eb3626420fc1a06acef072586438a7
2020-04-17 13:36:40 -07:00
3a400b8dc3 [tensorboard] Fix function input parameter for add_hparams (#31301)
Summary:
closes https://github.com/pytorch/pytorch/issues/30943

both parameters in add_hparams are mandatory.

cc sanekmelnikov orionr
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31301

Differential Revision: D21069761

Pulled By: natalialunova

fbshipit-source-id: 1a12f6760fa9676d11901fa85aa91f4f9fff976d
2020-04-17 11:19:14 -07:00
d92005ff73 Vectorize reduction when reducing on fastest striding dimension (#36709)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/36709

Test Plan: Imported from OSS

Differential Revision: D21083393

Pulled By: ngimel

fbshipit-source-id: ea3f7f29709c9a6e5b3ec45ba809cb2cf6c5e0c8
2020-04-17 10:12:49 -07:00
e6bc34f549 Amp gradient accumulation example (#36601)
Summary:
Several people have asked me about proper Amp usage with gradient accumulation.  In particular, it's [unclear to people](https://github.com/NVIDIA/apex/issues/439#issuecomment-610351482) that you should only call `scaler.unscale_()` (if desired) and `scaler.update()` in iterations where you actually plan to step.  This PR adds a minimal accumulation example.

I built the docs locally and it looks free from sphinx errors, at least.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36601

Differential Revision: D21082295

Pulled By: ngimel

fbshipit-source-id: b2faa6c02b9f7e1972618a0f1d5360a03f0450ac
2020-04-17 09:56:36 -07:00
adca88e821 Fix hardsigmoid/hardswish for proper device dispatch. (#36704)
Summary:
https://github.com/pytorch/pytorch/pull/36351 make `hardsigmoid_backward` use tensoriterator but that can be done only after proper device dispatch.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36704

Differential Revision: D21068126

Pulled By: ailzhang

fbshipit-source-id: 6a6a74216f2b50fa7d15f692cd1583d3d233580a
2020-04-17 09:42:09 -07:00
9df9aef9b9 [ROCm] Use float datatype for RNN test for MIOpen (#36772)
Summary:
This pull request changes the datatype for `test_RNN_cpu_vs_cudnn_no_dropout` on ROCm testing to float.
Currently MIOpen RNN does not support double datatype, so using only double would not run this test using MIOpen. To correctly test PyTorch RNN operator using MIOpen, we would need to test it using float tensors and module.
The changes in this PR addresses the comments in https://github.com/pytorch/pytorch/issues/34615

ezyang iotamudelta
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36772

Differential Revision: D21089533

Pulled By: ezyang

fbshipit-source-id: b5781e4ca270d64c6b949b3f0436e7b4eb870e27
2020-04-17 09:14:06 -07:00
4c666d42ff Handle log_sigmoid(out=) properly. (#36736)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36736

Fixes: https://github.com/pytorch/pytorch/issues/36499

Changes:
1) Moves some bindings from LegacyNNDefinitions to Activation so all of log_sigmoid lives together
2) Properly handle non-contiguous / incorrectly sized out parameters to log_sigmoid.  This is done by copying from a buffer if necessary.
3) Require that the internal buffer (different from 2)) is contiguous.  This should always be the case because it's always created internally.
4) Adds a test

Test Plan: Imported from OSS

Differential Revision: D21070934

Pulled By: gchanan

fbshipit-source-id: 94577313c32d1ef04d65c1d6657598304a39fe6e
2020-04-17 08:27:57 -07:00
46288465fe Print keyword-only arg symbol for function signature suggestions. (#36780)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36780

Fixes: https://github.com/pytorch/pytorch/issues/36773

Test Plan: Imported from OSS

Differential Revision: D21081993

Pulled By: gchanan

fbshipit-source-id: 624b0077f88208aafa131ab7b3e5f1fe9dd70987
2020-04-17 07:30:46 -07:00
ebdc4f02ad Fix incorrect merge of #34136. (#36760)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36760

If you look at https://github.com/pytorch/pytorch/pull/34136/, you will notice a commit (80c15c087c) that didn't get merged.
This is to address that, to avoid crashing on remainder when the rhs is 0.

Test Plan: Imported from OSS

Differential Revision: D21078776

Pulled By: gchanan

fbshipit-source-id: 0ac138cbafac28cf8d696a2a413d3c542138cff9
2020-04-17 07:25:13 -07:00
32bbf12aa7 Make trivial thread-idx for degenerate statements without thread-idx. (#36480)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/36480

Test Plan: Imported from OSS

Differential Revision: D20992505

Pulled By: zheng-xq

fbshipit-source-id: 3d4e5401b59b9507b5f2db659e511bd1af53f5ab
2020-04-17 02:31:07 -07:00
31f91d645a Improve aten::backward handling (#36750)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36750

- It seems the JIT schema for aten::backward and the schema in native_functions.yaml diverged on whether the retain_graph/keep_graph parameter takes a `bool` or a `bool?`. Make them identical again.
- Also remove the mutability annotation for the self parameter. This does not make sense together with AliasAnalysisKind::CONSERVATIVE and it triggered an assertion
- After fixing the mutability annotation, we can fix that assertion so that it doesn't exclude aten::backward from its check anymore
- Last but not least, remove the unboxed_only marker from aten::backward. This requires us to add a special case in register_c10_ops.cpp for it, because JIT still has its own implementation.
ghstack-source-id: 102351871

Test Plan: waitforsandcastle

Differential Revision: D21004102

fbshipit-source-id: 19bd1adbd8103c214d32e5126671a809adec581e
2020-04-17 00:49:49 -07:00
b45b9673a1 Fixes clang format (#36787)
Summary:
Fixes clang format.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36787

Differential Revision: D21084603

Pulled By: mruberry

fbshipit-source-id: 7e29da135f9a2aa126cb68640e33c1914fd570e3
2020-04-17 00:42:51 -07:00
a89d1ed549 Move unboxing for factory ops to after dispatch (#36564)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36564

-
ghstack-source-id: 102292162

Test Plan: waitforsandcastle

Differential Revision: D21014324

fbshipit-source-id: 5b95eafbb668bed174cd2c826e308e74e329f552
2020-04-16 22:44:00 -07:00
b5483b8286 [pytorch][PR] Re-enable a failing test (#36763)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/36763

Differential Revision: D21083309

Pulled By: Krovatkin

fbshipit-source-id: 4fb5b95bd3e01bd83a406d4394f266d7fd168f21
2020-04-16 22:21:47 -07:00
f00014b790 Revert D21080503: [pytorch][PR] [quant] Add backward compatiblity test
Test Plan: revert-hammer

Differential Revision:
D21080503

Original commit changeset: 1dca08208bcc

fbshipit-source-id: 5cd8c22130ff28b9231f657f80961e94b65b5792
2020-04-16 22:03:12 -07:00
b0227f2965 Add a test to verify non-contiguous tensors work correctly with RPC. (#36705)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36705

ghstack-source-id: 102257937

Test Plan: waitforbuildbot

Differential Revision: D21058176

fbshipit-source-id: 1d32730d61420324856cc641f888751418c1dd26
2020-04-16 21:56:36 -07:00
eccb40f505 Optimize mobile model on cloned module instead of in-place transformation (#36621)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36621

Instead of doing in-place transformation inside optimizeForMobile method,
we would like to maintain the original structure for passed scriptedModule,
so before optmization starts, we use the cloned module to do subsequent optimization
process and return the optimized cloned module.

Test Plan:
unit test
python test/test_mobile_optimizer.py

Imported from OSS

Differential Revision: D21028406

fbshipit-source-id: 756172ef99b1c1df6bb7d00e5deca85a4c239a87
2020-04-16 21:49:18 -07:00
76f9528878 fix an infininte loop in liveness (#36697)
Summary:
This should fix https://github.com/pytorch/pytorch/issues/36434

We create new nodes to insert explicit uses of loop counters while doing liveness analysis. This was totally fine when we had a two-pass liveness (since we don't really care about liveness sets for those nodes), but with the fixed-point algorithm we can *never* achieve the fixed point because the initial liveness sets for these new nodes start empty and we always add some live values to those sets thus `changed_` is always set `true`.
Now it's amazing that this didn't get exposed and worked for such a long time! Apparently, when we destroyed and recreated those new nodes they were allocated at the exact same addresses in the memory!!!!!! And we use those addresses as keys to get liveness sets, so these new nodes **inherited** the liveness sets �
I was still a bit sceptical of this explanation, so I added more tracing to liveness analysis and AFAICT this is exactly how we were able to get away with this bug for such a long time!!!

Here's a few excerpts from the trace.

Before we enter a loop we create a node to use loop's upper bound.

```
[DEBUG liveness.cpp:121] @#$Creating a store for mtc : 0x555777c19eb0
```

When processing the loop, we also process this node. Its liveness sets are empty!
```
[DEBUG liveness.cpp:099] Processing node  = prim::Store(%3) addr = 0x555777c19eb0
[DEBUG liveness.cpp:148] @#$liveness_sets_[it] : {}
```

We are done with this loop. We remove the node we added
```
[DEBUG liveness.cpp:127] @#$Destroying a store for ctc : 0x555777c19eb0
```

We are about to process the loop for the second time, so we create the use node again.
Note, it's allocated at the exact same address!!!
```
[DEBUG liveness.cpp:118] @#$Creating a store for ctc : 0x555777c19eb0
```

Now we process it again. But now it has non-empty sets even though it's a brand new node!!!!

```
[DEBUG liveness.cpp:099] Processing node  = prim::Store(%i) addr = 0x555777c19eb0
[DEBUG liveness.cpp:148] @#$liveness_sets_[it] : {2, 3, 10}
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36697

Differential Revision: D21059313

Pulled By: Krovatkin

fbshipit-source-id: b0fdeb4418e0e73f34187826877179260f21cf7b
2020-04-16 20:50:01 -07:00
6d4c509168 [autograd] lower MAX_DEPTH limit according to TSAN limit (#36745)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36745

As we hold a mutex for our custom C++ Node, when calling reentrant
backward from custom C++ function, we will cocurrently holding many
mutexes up to MAX_DEPTH. TSAN only allow 65 mutexes at once, otherwise
it will complain. This PR lower the limit according to TSAN.

TSAN Reference: https://github.com/google/sanitizers/issues/950

Test Plan: Imported from OSS

Differential Revision: D21072604

Pulled By: wanchaol

fbshipit-source-id: 99cd1acab41a203d834fa4947f4e6f0ffd2e70f2
2020-04-16 20:43:20 -07:00
d7fc05b0bf Fetch TORCH_SRCS from build_variables.bzl (#36737)
Summary:
Mimic `.bzl` parsing logic from https://github.com/pytorch/FBGEMM/pull/344
Generate `libtorch_cmake_sources` by running following script:
```
def read_file(path):
    with open(path) as f:
        return f.read()

def get_cmake_torch_srcs():
    caffe2_cmake = read_file("caffe2/CMakeLists.txt")
    start = caffe2_cmake.find("set(TORCH_SRCS")
    end = caffe2_cmake.find(")", start)
    return caffe2_cmake[start:end+1]
def get_cmake_torch_srcs_list():
    caffe2_torch_srcs = get_cmake_torch_srcs()
    unfiltered_list = [x.strip() for x in get_cmake_torch_srcs().split("\n") if len(x.strip())>0]
    return [x.replace("${TORCH_SRC_DIR}/","torch/") for x in unfiltered_list if 'TORCH_SRC_DIR' in x]

import imp
build_variables = imp.load_source('build_variables', 'tools/build_variables.bzl')
libtorch_core_sources = set(build_variables.libtorch_core_sources)
caffe2_torch_srcs = set(get_cmake_torch_srcs_list())
if not libtorch_core_sources.issubset(caffe2_torch_srcs):
    print("libtorch_core_sources must be a subset of caffe2_torch_srcs")
print(sorted(caffe2_torch_srcs.difference(libtorch_core_sources)))
```

Move common files between `libtorch_cmake_sources` and `libtorch_extra_sources` to `libtorch_jit_core_sources`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36737

Test Plan: CI

Differential Revision: D21078753

Pulled By: malfet

fbshipit-source-id: f46ca48d48aa122188f028136c14687ff52629ed
2020-04-16 19:12:52 -07:00
dcfc121fd7 Enable jit trace check_trace for quantized inputs (#36740)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36740

Issue #23986

Test Plan:
python test/quantization/test_quantized_nn_mods.py

Imported from OSS

Differential Revision: D21077551

fbshipit-source-id: fdd15db3284975c99b3e250a568fa94c617d21eb
2020-04-16 19:06:55 -07:00
484a00b2d3 [quant] Add backward compatiblity test (#36771)
Summary:
re-created the same PR: https://github.com/pytorch/pytorch/pull/36639
because ghimport does not support importing binary files right now
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36771

Test Plan: python test/quantization/test_backward_compatibility.py

Differential Revision: D21080503

Pulled By: jerryzh168

fbshipit-source-id: 1dca08208bccead60bba03e5fb5d39e1a1d7c20d
2020-04-16 19:00:30 -07:00
2c558dba3d quantized layer norm: add to static quant (#36690)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36690

Adds the static quantization hook for LayerNorm

Test Plan:
```
python test/quantization/test_quantized_nn_mods.py ModuleAPITest.test_layer_norm
python test/quantization/test_quantization.py EagerModePostTrainingQuantTest.test_normalization
```

Imported from OSS

Differential Revision: D21055401

fbshipit-source-id: 188329f35359576d50ed0db5fb675ce68c28bf7d
2020-04-16 18:18:02 -07:00
24aac32171 [jit] Add dictionary as output of tracer (#36696)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36696

This PR add dictionary as a supported output of tracer under the strict
flag.

Test Plan: Imported from OSS

Reviewed By: houseroad

Differential Revision: D21056962

Pulled By: wanchaol

fbshipit-source-id: ace498182d636de853cf8a1efb3dc77f5d53db29
2020-04-16 18:12:38 -07:00
e1cb8577ac [jit] remove Dict iterationOrder and use insertion order (#36609)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36609

This PR remove iterationOrder that backed by CompareKeys, we universally
use Dict inseration order default backed by c10::Dict to match the
python behaviors

Test Plan: Imported from OSS

Reviewed By: houseroad

Differential Revision: D21056963

Pulled By: wanchaol

fbshipit-source-id: 487961c2db2cdc27461b2fbd6df91faafc6920b5
2020-04-16 18:11:15 -07:00
05bbf6afb6 Revert D20964193: Port to new registration API (part 1)
Test Plan: revert-hammer

Differential Revision:
D20964193

Original commit changeset: 27aeea01ccf5

fbshipit-source-id: b17e1342431c493055f053dd575cf24065335bf9
2020-04-16 17:54:32 -07:00
63e5058c88 Fix naming of "strides" method in TensorType (#36727)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36727

Looks like this was renamed by accident in 0cbd7fa46f2

Test Plan:
Unit test.
Lint.

Differential Revision: D21076697

Pulled By: dreiss

fbshipit-source-id: dbd18cb41c7b26479984a7a7b12ad41a4c5b7658
2020-04-16 17:07:27 -07:00
753157b88e [quant][graph] Graph mode quantization support for sigmoid (#36622)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/36622

Test Plan:
python test/quantization/test_quantize_script.py test_swap_dequantize_all_ops

Imported from OSS

Differential Revision: D21075255

fbshipit-source-id: 025f432215eaa8acf34d492e7722102ca053abeb
2020-04-16 16:50:54 -07:00
17c268be10 [quant][graph] Add quantized batch_norm2d_relu to graph mode (#36552)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36552

Do the fusion for inplace and non-inplace relu
Tested for functional relu as well.
Functional batch_norm is not a usual use-case (since it expects the weight, bias, mean, var) so that is not tested.

Test Plan:
test_quantize_script.py test_batch_norm2d_relu

Imported from OSS

Differential Revision: D21075253

fbshipit-source-id: 0a07ea477cab19abf1d1b0856e623b1436240da1
2020-04-16 16:49:26 -07:00
66158868d5 Update reference to RegisterOperators in error message in Convolution (#36389)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36389

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Differential Revision: D20964193

Pulled By: ezyang

fbshipit-source-id: 27aeea01ccf5dfcebb8f043cde009a14dde3958e
2020-04-16 16:22:31 -07:00
1fc3556ec9 Teach the tensorexpr vectorizer to handle nested For loops. (#36467)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/36467

Differential Revision: D21013179

Pulled By: resistor

fbshipit-source-id: aa4f3da58cf16934f11e0cf4252a300cbac98f21
2020-04-16 15:40:44 -07:00
e9b4580411 Revert D20839674: [pytorch][PR] Re-enable a failing test
Test Plan: revert-hammer

Differential Revision:
D20839674

Original commit changeset: 68f41610a823

fbshipit-source-id: b69ccfd49bbde566870fa53cd3fe2931721db4ea
2020-04-16 15:26:34 -07:00
37479ddf4e [caffe2] create and register child ws in pybind (#36741)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36741

Create child workspace that shares parent workspace's blobs. Register child workspace in registrar to enable switching into child workspace and feeding to child workspace alone.

Test Plan: numeric suite unit tests in stacked diff

Reviewed By: hx89

Differential Revision: D21055567

fbshipit-source-id: 374b12aef75a4c58452c271f8961ee156ce6c559
2020-04-16 14:53:55 -07:00
5b515fd034 Delete pytorch_linux_xenial_cuda9_cudnn7_py3_build (#36731)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/36731

Differential Revision: D21071658

Pulled By: malfet

fbshipit-source-id: c0e072bd5316e332f5dbc818f4bc140ce8950437
2020-04-16 14:46:42 -07:00
4894cba572 Revert D19775659: [WIP] Move profiler to a dispatch wrapper
Test Plan: revert-hammer

Differential Revision:
D19775659

Original commit changeset: 5cbe5f736660

fbshipit-source-id: dcb41d2433697c5d521044a9dbc12c79f31e0929
2020-04-16 14:18:51 -07:00
ee3d046f87 [TensorExpr] Add support for Axis reordering in LoopNest (#36540)
Summary:
Adds a capability for reordering axes in the LoopNest. This was fairly straightforward except when handling Reduction initializers which required more changes, UPDATE: actually the complicated bit was preserving the ordering of statements in the loopnest which should not be reordered.

Usage looks something like this:

```
Tensor* tensor = Compute(
    "f", {{2, "x"}, {3, "y"}}, [](const VarHandle& x, const VarHandle& y) {
      return ExprHandle(1.0f) + cast<float>(x) * x + cast<float>(y) * y;
    });
LoopNest l({tensor});

/* LoopNest looks like:
  for x in ...
    for y  in ...
       f[x,y] = 1 + x * x + y * y;
*/

auto loops = l.getLoopStmtsFor(tensor);
l.reorderAxis(tensor, loops[0], loops[1])

/* LoopNest looks like:
  for y in ...
    for x  in ...
       f[x,y] = 1 + x * x + y * y;
*/
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36540

Differential Revision: D21068143

Pulled By: nickgg

fbshipit-source-id: f02c29004376df4f5a9bedff366c075772726618
2020-04-16 13:42:47 -07:00
a85c835196 [WIP] Move profiler to a dispatch wrapper (#33057)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/33057

Test Plan: Imported from OSS

Differential Revision: D19775659

Pulled By: jamesr66a

fbshipit-source-id: 5cbe5f736660c8543764ef62b16550638d9ceb72
2020-04-16 13:36:37 -07:00
487dc0f961 Re-enable a failing test (#35847)
Summary:
This test was failing because caching resulted into a function with multiple execution plans rather than multiple functions with a single execution plan each as a test writer intended.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35847

Differential Revision: D20839674

Pulled By: Krovatkin

fbshipit-source-id: 68f41610a823d94c1e744c85ac72652c741d73ae
2020-04-16 11:46:02 -07:00
3567b881a5 make sure dispatch test works on windows (#36729)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36729

setenv not available on windows

Test Plan: CI green in ovrsource

Reviewed By: stepancheg

Differential Revision: D21067835

fbshipit-source-id: ddbc3285ef88f123dc6a200b661c48cfafc6bf00
2020-04-16 11:36:56 -07:00
cb6bebfa9b [quant][graph] Add quantized batch_norm2d support to graph mode (#36692)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/36692

Test Plan:
python test_quantize_script.py

Imported from OSS

Differential Revision: D21055596

fbshipit-source-id: b21ad6bb9763cd2e7b22f525a9a46e5f4d485e17
2020-04-16 11:12:51 -07:00
dd4dece68a [quant][graph] Add useQuantizable function (#36691)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36691

Enables to selectively insert observers at the inputs of aten/call functionc

Test Plan:
test_quantize_script.py

Imported from OSS

Differential Revision: D21055597

fbshipit-source-id: b47733b94b127d7a47b3224da7af98f0da38d30d
2020-04-16 11:11:10 -07:00
54a575c9bd [JIT] fix torch.tensor jit dtype (#36587)
Summary:
Previously we were always creating a double tensor from `torch.tensor(1.)`, whereas python eager uses the current default dtype. Fix for https://github.com/pytorch/pytorch/issues/36369
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36587

Differential Revision: D21043617

Pulled By: eellison

fbshipit-source-id: 38da303594f52e06941d86b6e57c4a06e7d36938
2020-04-16 10:55:49 -07:00
e29348f828 Switch to pybind11 style registration function API. (#36258)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36258

Previous we had a && chaining style API.  There are some downsides to
this API:

- It's easy to forget the 'static' qualifier in front, leading to
  subtle ODR bugs.
- It is not compatible with torchbind class_ definitions, as these
  need multiple levels of chaining.  So in practice people end
  up having to define multiple static initializers, one per class.
- It's not like pybind11.
- There's no way to conveniently get the file and line number of
  the registration, as there is no macro point in the API.
- The old API doesn't really encourage people to put all of their
  definitions for a library in one place, and to give a custom
  namespace for it.  Similarly, the old API wasn't very DRY, because
  you had to keep repeating the namespace/dispatch key you
  were writing implementations for.

The new API is modeled exactly off of the PYBIND11_MODULE macro:
you write:

```
TORCH_LIBRARY(aten, m) {
  m.def("aten::add(Tensor self, Tensor other) -> Tensor");
  ...
}
```

in a non-chaining fashion, and under the hood the macro expands to
define a function, and define a static initializer that allocates
c10::Library (previously called c10::Module, but we renamed it
to avoid confusion with the existing NN module concept), passes
it to your function, and then retains it for the rest of the lifetime
of the program.  Specification of the namespace is mandatory,
and in later commit I plan to make it a hard error to TORCH_LIBRARY
the same library name twice.

If you are specifying an implementation for an existing operator
(e.g., you're the XLA backend, or even if you're just putting
registrations for implementations at the implementation site),
you should use TORCH_LIBRARY_IMPL, which instead takes a backend
argument (instead of namespace) and can be used to specify an
implementation for a backend.  Unlike TORCH_LIBRARY, you can do
as many of these as you want for a backend.

This needs updates to the mobile code analyzer.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Differential Revision: D20929257

Pulled By: ezyang

fbshipit-source-id: ba04d78492e8c93ae7190165fb936f6872896ada
2020-04-16 10:44:21 -07:00
3c85f44ce8 Fail setup.py if trying to set up with Python 2 (#35613)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35613

Python 2 has reached end-of-life and is no longer supported by PyTorch.
To spare users from a long, doomed setup when trying to use PyTorch with
Python 2, detect this case early and fail with a clear message.  This
commit covers setup.py.

Test Plan: Attempted to build PyTorch with Python 2 and saw a clear error *quickly*.

Differential Revision: D20842881

Pulled By: dreiss

fbshipit-source-id: caaaa0dbff83145ff668bd25df6d7d4b3ce12e47
2020-04-16 10:24:03 -07:00
83de675ebf Fail CMake setup if trying to build with Python 2 (#35612)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35612

Python 2 has reached end-of-life and is no longer supported by PyTorch.
To spare users from a long, doomed build when trying to use PyTorch with
Python 2, detect this case early and fail with a clear message.  This
commit covers CMake setup.

Test Plan: Attempted to build PyTorch with Python 2 and saw a clear error *quickly*.

Differential Revision: D20842873

Pulled By: dreiss

fbshipit-source-id: b35e38c12f9381ff4ca10cf801b7a03da87b1d19
2020-04-16 10:22:36 -07:00
ac950bb9c8 Update docs for master to remove Python 2 references (#36336)
Summary:
Fix compile error from original PR in jit_language_references.rst: https://github.com/pytorch/pytorch/pull/36114

Full details in task: https://our.intern.facebook.com/intern/tasks/?t=64776265

With pytroch 1.5+ we remove python2 support from PyTorch. All documentation under docs/ and on the pytorch.org website needs to remove Python 2 references.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36336

Differential Revision: D21057507

Pulled By: jlin27

fbshipit-source-id: 993a763f1ecb16dad859bc02a07625ddc023645d
2020-04-16 10:15:48 -07:00
f5c230b892 Make futures vector a local function var (#36677)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36677

Move the `futures` vector to be a local function var like `errorFutures`. Holding the lock to clear the vector is now unnecessary.
ghstack-source-id: 102265569

Differential Revision: D20884589

fbshipit-source-id: c9a13258bee737d86f9b0d11cdd28263bb923697
2020-04-16 10:09:39 -07:00
f11c4f90c2 New CUDA Fuser: Unrolling support, interface refactor (#36435)
Summary:
Unrolling support has been added in a way that we get good performing code on GPUs. Not sure how long this link will last but an example of a generated unrolled kernel is:
https://godbolt.org/z/i0uAv3

What can be seen from there is multiple calls of "ld.global.f32" without "ld.store.f32" in between them (and vice versa). This means that we are launching multiple loads that can be run in parallel, as well as multiple stores that can be run in parallel. This can be a crucial optimization for memory bound kernels. This was generally a point of concern in TVM as an attempt of a similar kernel from TVM produces: https://godbolt.org/z/Vu97vG which surrounds load - store pairs in conditional branches preventing the benefits of unrolling.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36435

Reviewed By: ZolotukhinM

Differential Revision: D21024011

Pulled By: soumith

fbshipit-source-id: e852e282fa7a304aba962e1926f756098c011fe0
2020-04-16 09:20:24 -07:00
d7fabfd5df Implements complex isfinite and isinf (#36648)
Summary:
Implements complex isfinite and isinf, consistent with NumPy.

A complex value is finite if and only if both its real and imaginary part are finite.

A complex value is infinite if and only if its real or imaginary part are infinite.

Old isfinite, isinf, and isnan tests are modernized and instead of fixtures the torch results are compared with NumPy. A new test is added for complex isfinite, isinf, and isnan. The docs for each function are updated to clarify what finite, infinite, and NaN values are.

The new tests rely on a new helper, _np_compare, that we'll likely want to generalize in the near future and use in more tests.

Addresses part of the complex support tasks. See https://github.com/pytorch/pytorch/issues/33152.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36648

Differential Revision: D21054766

Pulled By: mruberry

fbshipit-source-id: d947707c5437385775c82f4e6c722349ca5a2174
2020-04-16 09:09:02 -07:00
d0c925f1c7 Returns float tensors for complex inputs to abs (#35871)
Summary:
Per title. A test is added to test_type_promotion for the behavior. This behavior is consistent with NumPy's.

For complex inputs to `abs` the result is cast to float after the computation since the computation of abs must be performed on the original complex tensor. While `std::abs` returns a float value when called on complex inputs, returning a FloatTensor directly would require additional loop instantiations in TensorIterator. This may be worthwhile to pursue in the future.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35871

Differential Revision: D20984456

Pulled By: mruberry

fbshipit-source-id: 226445178f92f2b0292e92578656d98674a6aa20
2020-04-16 09:03:17 -07:00
bede7d9995 Fixed check for the buffer overflow in assert (#36476)
Summary:
This code looks like a mistake
```C++
AT_ASSERT(size_t(kind) < sizeof(names) / sizeof(AttributeKind));
```
It does not check if `kind` variable fits in array of pointer called `names`

Even if we write something like this: that assert won't fail
```C++
AttributeKind kind = AttributeKind::ival;
*((unsigned int*)&kind2) += 1;
```
So I fixed it
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36476

Differential Revision: D21018748

Pulled By: colesbury

fbshipit-source-id: f4d3b8faf64cf07232d595075f831805084f5d00
2020-04-16 08:58:03 -07:00
9e016f77a8 Added complex types to get_all_dtypes and turned on masked_fill for complex (#36335)
Summary:
1. Added complex dtypes to get_all_dtypes to unify testing for complex dtypes with other dtypes so that they don't get out of sync with behavior supported for other dtypes.
2. resolves https://github.com/pytorch/pytorch/issues/36322, https://github.com/pytorch/pytorch/issues/36327
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36335

Differential Revision: D21045603

Pulled By: anjali411

fbshipit-source-id: 5089306b66fdc18148e831f56298da5de673be67
2020-04-16 08:24:45 -07:00
049dede3be Move rpc.rst back to the source folder to preserve existing doc URLs (#36675)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/36675

Test Plan: Imported from OSS

Differential Revision: D21048628

Pulled By: mrshenli

fbshipit-source-id: 3cb1b35ddc1f40c673b0db9048d77dfa024be1e7
2020-04-16 08:12:34 -07:00
30fabd9398 Creates "Versioned Symbol" pattern to preserve serialized Torchscript semantics (#36300)
Summary:
PyTorch users write programs and save them as serialized Torchscript. When this Torchscript is loaded it contains symbols like "aten::div" describing some of the program's behavior. If the behavior of these symbols has changed since the program was serialized, however, then the original program's semantics may not be preserved.

For example, when we make aten::div always perform "true" division, like NumPy, Python3, and JAX, then serialized Torchscript programs relying on aten::div performing floor division on integral inputs will break.

This PR demonstrates the "Versioned Symbol" pattern that lets symbols be remapped into Torchscript builtins that preserve their historic behavior. Using this pattern, after we update aten::div to always perform true division, we could remap it in older Torchscript programs to a builtin that implements its historic behavior.

The pattern is described in the [Versioned Symbols] note in the code and is implemented like this:

- If BuiltinModule is given a version, before it returns a symbol it queries to see if another symbol should be substituted for it.
- versioned_symbol.cpp has a map for symbols and the version range for which another symbol should be substituted for them.
- The substitutions are implemented as builtin functions.

An example using the new, test-only _subcmul function is implemented to test this behavior. A test in jit/test_save_load.py follows the pattern described in the [Versioned Symbols] note and uses a fixture serialized with file version 2 to verify that the historic behavior is preserved.

In the future we will likely need a slightly more complex mechanism with multiple substitutions with distinct version ranges, and this just requires changing the map to be Symbol->Iterable.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36300

Differential Revision: D21058990

Pulled By: mruberry

fbshipit-source-id: 2b7c732878c0ecfcd9f0a6205fb6d6421feeaf61
2020-04-16 04:56:53 -07:00
0785585db9 Reland Make DispatchKeyExtractor forget about TensorOptions (#36290) (#36562)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36562

The BackendSelect dispatch key gives us a way to extract backend-
specific dispatch keys from non-Tensor arguments without teaching
the DispatchKeyExtractor about them. Here we finish switching over
to the BackendSelect approach for factory functions and remove
TensorOptions from the set of types DispatchKeyExtractor needs to
consider.

Test Plan: Imported from OSS

Differential Revision: D21013652

Pulled By: bhosmer

fbshipit-source-id: e30512d1c3202149e72b7d7ce15084bbfed63ac7
2020-04-16 01:17:32 -07:00
f89fc204c6 [caffe] fix input order in SLS op documentation (#36708)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36708

WEIGHTS is the second input operand of SparseLengthsWeightedSum operators but in the documentation the order was wrong.

Test Plan: CI

Reviewed By: yinghai

Differential Revision: D21058240

fbshipit-source-id: e160e983603e606e63fbbfdee34d98d3587870d8
2020-04-16 00:55:54 -07:00
7539ea0207 [TensorExpr] Add simplification of length 0 and 1 For loops to IR Simplifier (#36348)
Summary:
Simplifies loops which can be collapsed down into a single block or removed entirely. E.g.

```
For 0..1 {
  Statements...
}
```

Is now just `Block({Statements...})`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36348

Differential Revision: D21057959

Pulled By: nickgg

fbshipit-source-id: 2f95a19a965c4a6e023680e2cea9ea846e82d62e
2020-04-15 23:56:34 -07:00
e17cf93b9a Report tesnro_expr test results (#36684)
Summary:
Also download MNIST dataset in background while jit tests are running
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36684

Differential Revision: D21059045

Pulled By: malfet

fbshipit-source-id: 9904be303763e9891f1818c1334f328bc0d0e4a7
2020-04-15 22:16:46 -07:00
f548946363 Fix out-of-boundary access in caffe2::StartsWith (#36672)
Summary:
`std::mismatch( InputIt1 first1, InputIt1 last1, InputIt2 first2 )` assumes that container for `first2` iterator contains at least `last1 - first` elements, which is not the case if `prefix` is longer than `str`
Found while running unit tests on Windows
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36672

Differential Revision: D21049407

Pulled By: malfet

fbshipit-source-id: ad45779d47a0c6898900e0247c920829a2179f62
2020-04-15 20:40:59 -07:00
30dd0b74fd Save view_fn for inplace update on view tensors (#36073)
Summary:
This PR enables inplace updates on view Tensors for tensor types(XLA) that doesn't support as_strided.
(See Notes inside PR)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36073

Reviewed By: yf225

Differential Revision: D20994282

Pulled By: ailzhang

fbshipit-source-id: 83eeccb297b242f9822f08ad110a7045d7055639
2020-04-15 20:11:27 -07:00
f64fae9193 Fix race in mark_graph_task_completed. (#36640)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36640

We had the following race when two threads entered
'mark_graph_task_completed'.

1) Thread 1 grabs the graph_task mutex first and moves captured_vars_ to its
local 'vars'.
2) Thread 1 releases the lock.
3) Thread 2 grabs the mutex and moves an empty captured_vars_ to its local
'vars'.
4) Thread 2 now proceeds to call 'markCompleted' with empty grads.
5) Thread 1 which actually has the right grads never gets to set the grads on
the future since future_completed_ is set to True by Thread 2.

Discovered this while running our RNN example:
https://github.com/pytorch/examples/tree/master/distributed/rpc/rnn and
verified this PR fixes the race.
ghstack-source-id: 102237850

Test Plan: waitforbuildbot

Differential Revision: D21035196

fbshipit-source-id: 1963826194d466b93f19e8016b38e4f9cad47720
2020-04-15 20:05:34 -07:00
a5d0d762fa redo of add quantized layer norm implementation (#36593)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36593

This is a redo of https://github.com/pytorch/pytorch/pull/35329 with a
better test.

Adds a quantized implementation of LayerNorm for server.

A future PR will add the Python wrapper.

Test Plan:
numerics match the floating point implementation

benchmarks by input size:
v1 (mean+var non-vectorized): https://gist.github.com/vkuzo/f6d72c04742608112f4c2e612c74bd13
v2 (mean+var vectorized in float): https://gist.github.com/vkuzo/4dd95657c5b5f3654e0965db00eff8d2
v3 (mean+var vectorized in int, current): https://gist.github.com/vkuzo/57a75f75629da9f23b64b38ca0e3d34b

Differential Revision: D21030268

Pulled By: vkuzo

fbshipit-source-id: b3594c3393cfce37a881319e2e0560620d51080f
2020-04-15 19:47:18 -07:00
91f1d79d1b hardswish: enable for QAT (#36604)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36604

Adds the logic to wrap the HardSwish module in FakeQuant
to support QAT.

Test Plan:
Added test to cover that this happens properly.

Imported from OSS

Differential Revision: D21045322

fbshipit-source-id: 8c46559ade58a5d5c56442285842627a3143eb0f
2020-04-15 18:04:11 -07:00
65df8b3886 hardswish: make it work in static quantization (#36545)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36545

* adds a quantized nn.module for Hardswish so we can observe activation values
* modifies the hardswish op to allow specifying scale + zero_point
* makes hardswish model be properly swapped in static quantization

Test Plan:
added tests and they pass for:
* the new _out flavor of hardswish
* QNNPACK changes
* static quant e2e

Imported from OSS

Differential Revision: D21045320

fbshipit-source-id: ab7e52f0f54a7d5923ab6f58197022cc28c12354
2020-04-15 18:02:35 -07:00
9cbeb0faed [JIT] Dont optimize shape peepholes on inline (#36404)
Summary:
With https://github.com/pytorch/pytorch/pull/35562, we are running peephole optimization on inlining to reduce the number of nodes that are copied.

The tracer encodes the sizes in the graph like:
```
graph(%0 : Double(7)):
  %1 : Function = prim::Constant[name="tensor_size"]()
  %2 : Tensor = prim::CallFunction(%1, %0)
  return (%2)
```

however people would like to reuse the graph with different shapes so running size invalidations would invalidate that. long term it might be better for the tracer to not include shape information but there are downstream users of that.

Separates out FuseAddMM from peephole so that now there is a single `disable_size_optimizations` parameter, and onnx explicitly invokes fuseaddmm.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36404

Differential Revision: D20968974

Pulled By: eellison

fbshipit-source-id: 56f8f1699e3b0adeeccdfd5a67bb975fd41a2913
2020-04-15 17:49:48 -07:00
a99b169828 [TensorExpr] fix a bug in LLVM codegen around empty kernels (#36660)
Summary:
LLVM Codegen assumes that the kernel contains real statements, but that is not guaranteed, especially after IR Simplification. This PR adds a catch for the case where no value is generated after recursing the LLVMCodegen visitor through the kernel.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36660

Differential Revision: D21044066

Pulled By: nickgg

fbshipit-source-id: e521c766286b1ff4e26befcec7ff4959db8181a4
2020-04-15 17:45:06 -07:00
8d66f88eb1 [jit] Fix bound method copying (#36546)
Summary:
Previously we were copying the bound method of the original class to the
new script module class, which causes `self` to be wrong. This PR
changes it so we fetch the unbound function, then bind it to the new
script module, then attach it to the module.

Fixes #28280
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36546

Pulled By: driazati

Differential Revision: D21023329

fbshipit-source-id: 6b3f8404700860151792f669a9c02fbd13365272
2020-04-15 17:38:20 -07:00
5927a6731c [PyTorch Docs] Updated RRef docs to indicate RPC Retries (#36678)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36678

Updated the docs to explicitly indicate that RRef control messages are
idempotent and retried upon failure.
ghstack-source-id: 102225791

Test Plan: build bot

Differential Revision: D20828041

fbshipit-source-id: ca4d71c65a453664c16c32134c47637a966b1a19
2020-04-15 17:33:20 -07:00
6bd6b70a02 Fix clang-format (#36685)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/36685

Differential Revision: D21052657

Pulled By: malfet

fbshipit-source-id: b4ec7eba21864108a1108f8c83b5d33cf31ab89e
2020-04-15 17:02:20 -07:00
609b6875f9 Enable test_upsamplingNearest2d_launch_fail on ROCm (#36624)
Summary:
The test case exercised in `test_upsamplingNearest2d_launch_fail` will fail on ROCm. The max. grid size per dimension for ROCm are 4294967295(0xffffffff), which is why the tensor dims in `test_upsamplingNearest2d_launch_fail` must give correct results.
This PR adds that test case `test_upsamplingNearest2d_launch_rocm` for ONLY ROCm scenario which is essentially the same as `test_upsamplingNearest2d_launch_fail` without an expected failure decorator

ezyang iotamudelta
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36624

Differential Revision: D21050330

Pulled By: ezyang

fbshipit-source-id: d7370c97eaab98f382f97052ed39cc168a3bfa71
2020-04-15 16:29:53 -07:00
2cf53128a8 Switch xla job to use bionic clang9 image (#36618)
Summary:
XLA need to switch to clang9 to build with latest TF dependency.
We keep pytorch/pytorch build remain using gcc for now.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36618

Differential Revision: D21045723

Pulled By: ailzhang

fbshipit-source-id: 015b65dad2aeef31fd66b753d519b2c9b9ed8b7f
2020-04-15 16:00:42 -07:00
ddd9eb3e12 Make special cases prim ops instead (#36635)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36635

Those ops were manually written in register_aten_ops.cpp, which had a few issues, for example caused them to be duplicated across all register_aten_ops_X.cpp and exist multiple times.

Instead, these should just be regular prim ops.
ghstack-source-id: 102204991

Test Plan: waitforsandcastle

Differential Revision: D21032778

fbshipit-source-id: 18f5eef1cad842d89c97610fc77b957608d2b15e
2020-04-15 15:54:31 -07:00
a3314f1902 [jit] Add return statement back to Future::addCallback() (#36662)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36662

This was a mistake from an earlier change, though the expected impact is
relatively minimal - mostly keeping callback around longer than necessary
in the case of callbacks already-completed futures.
ghstack-source-id: 102203224

Test Plan: buck test mode/dev-nosan caffe2/test/...

Differential Revision: D21044145

fbshipit-source-id: f3bd58bd6bde83caaa7b9bd0385d0ce3647dbc05
2020-04-15 13:09:40 -07:00
dad25ae47d Add the one-block multi-thread global reduction support. (#36306)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36306

Missing __syncthreads between sections.

Differential Revision: D20957254

Test Plan: Imported from OSS

Reviewed By: ZolotukhinM

Pulled By: zheng-xq

fbshipit-source-id: c988f0205b667174b3ee851c28adeec2dbd089f7
2020-04-15 13:05:11 -07:00
e80813fae3 Add trivial reduce for Cuda (#36293)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36293

Detect non-read-only loads, and not to use __ldg.
Resubmiting #36092

Test Plan: Imported from OSS

Reviewed By: ZolotukhinM

Differential Revision: D20935933

Pulled By: zheng-xq

fbshipit-source-id: f9280db26aa9c9c8119cea12571bc820f5fbcb61
2020-04-15 13:03:58 -07:00
efab75730f Migrate release CI jobs to CircleCI for Windows (#36657)
Summary:
It should work for both tagged builds and nightly builds now.
Corresponding test pr: https://github.com/pytorch/pytorch/pull/36580
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36657

Differential Revision: D21047686

Pulled By: seemethere

fbshipit-source-id: ad7065fc30f9b0d353bff52d4a9f35c8470daf63
2020-04-15 12:50:26 -07:00
5afd816793 Add a warning for Single-Process Multi-GPU DDP (#36656)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/36656

Test Plan: Imported from OSS

Differential Revision: D21042537

Pulled By: mrshenli

fbshipit-source-id: fa3501dc2bba14550ec4f254612a80f61fe86a4a
2020-04-15 12:43:50 -07:00
df9a250b8d [pt][quant] avgpool3d for graph mode (#36598)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36598

avgpool3d op for graph mode quantization
ghstack-source-id: 102204586

Test Plan: buck test //caffe2/test:quantization -- 'TestQuantizeScriptPTSQOps'  --print-passing-details 2>&1 | tee b.log

Differential Revision: D21023035

fbshipit-source-id: cb5e627763513a19dba099a79cad750914b14ec6
2020-04-15 12:39:32 -07:00
ba3d4019e9 Remove prim::CudaFusionGroup from register_prim_ops_fulljit.cpp: it is registered in jit/codegen/cuda/interface.cpp. (#36661)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/36661

Test Plan: Imported from OSS

Differential Revision: D21044112

Pulled By: ZolotukhinM

fbshipit-source-id: aba280c0ddae1350239c0656ae37203dbd620534
2020-04-15 12:35:52 -07:00
62e884f8d9 Report bazel-test results as CircleCI metadata (#36643)
Summary:
Also print docker container stats at the end of the run
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36643

Differential Revision: D21044161

Pulled By: malfet

fbshipit-source-id: 6877d8ce4789116ef270124307844f6cef7dcef5
2020-04-15 11:27:48 -07:00
f98e0a099a [pytorch] handle pybind11 style registration API with code analyzer (#36607)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36607

PR #36258 and subsequent PRs in the stack switch c10 registrations to
the new pybind11 style registration API. One notable difference from old
c10 registration API is that, operator's namespace is no longer in op
schema string, e.g. "aten::" will be factored out from "aten::conv",
"aten::emtpy" and etc. The namespace string will be declared at the
beginning of registrations with TORCH_LIBRARY / TORCH_LIBRARY_IMPL
macro.

A rather simple fix is to extract namespace string from the name of
enclosing function of registrations, as the TORCH_LIBRARY macro will
always create an init function (per namespace) by appending namespace
string to a common prefix.

Another side effect of the API change is that it adds some debug string
constants to the registration API, and because of factoring out the
namespace part from op name, there is no longer an effect way to
differentiate between real op name and debug strings. A simple
workaround is that we only keep the first string constant it encounters
while BFSing the LLVM IR - the real op name is directly passed into the
registration call while the debug string is indirectly passed via
CppFunction.

These new assumptions might be broken by future changes but it's so simple
to implement to unblock the API work.

Test Plan: Imported from OSS

Differential Revision: D21026008

Pulled By: ljk53

fbshipit-source-id: c8c171d23aaba6d6b7985d342e8797525126a713
2020-04-15 11:03:41 -07:00
527cf877d6 Delete old mkl_speed_test.py
Summary: It was always skipped for last 1.5 years (since D10372230 was landed)

Test Plan: CI

Reviewed By: ailzhang

Differential Revision: D21036194

fbshipit-source-id: 9ace60b45a123a9372a88310b91f33a69ae8880c
2020-04-15 11:02:01 -07:00
4a49ad0da7 Fixed error Regex Parsing for Node Failure Tests (#36620)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36620

Sending to a node that has been shutdown in ProcessGroupAgent could throw several possible exceptions. This PR updates the tests to check for the right exceptions while waiting for other nodes in the gang to fail in `test_backward_node_failure` and `test_backward_node_failure_python_udf`.
ghstack-source-id: 102153944

Test Plan: Stress-tested `test_backward_node_failure` and `test_backward_node_failure_python_udf`. They were previously completely broken, but this change makes  `test_backward_node_failure`  functional and `test_backward_node_failure_python_udf` is flaky but fails infrequently. A change to make the last test work reliably is planned.

Differential Revision: D21027280

fbshipit-source-id: e85c2d219ee408483442bd9925fff7206c8efe4b
2020-04-15 10:54:59 -07:00
87be115fd0 Error Handling in RPC Agent (#35263)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35263

Process Group Agent throws an exception if a send attempt is made after the agent is shutdown. With retries, we should catch this exception and mark the original future with an error.
ghstack-source-id: 102153897

Test Plan: Running all rpc/dist_autograd tests.

Differential Revision: D20611412

fbshipit-source-id: a6009f0b0aa8be662364158962a054c5c29090bf
2020-04-15 10:53:31 -07:00
1e7155caa5 Bucketization (#7284) (#34577)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/34577

Test Plan: Imported from OSS

Differential Revision: D20380975

Pulled By: glaringlee

fbshipit-source-id: d75939bc54d98675f88d7037491a8420ac20847a
2020-04-15 10:32:51 -07:00
3c8921b747 hardswish: add backards pass test (#36420)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36420

Adds a unit test for hardswish backward pass

Test Plan:
Unit test passes on cpu and cuda

Imported from OSS

Differential Revision: D20994100

fbshipit-source-id: 579df709cc2d92fce3b9a0eeb6faeb9fe8d2f641
2020-04-15 10:17:13 -07:00
16e90eba59 hardsigmoid: add cuda kernels (#36351)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36351

Adds CUDA kernels for hardsigmoid, to enable its use in training.

Note: the update to the cpu backward pass is to keep the cpu vs cuda
logic consistent, no change in functionality.

Test Plan:
add CI for the forward pass
run this for the backward pass:
https://gist.github.com/vkuzo/95957d365600f9ad10d25bd20f58cc1a

Imported from OSS

Differential Revision: D20955589

fbshipit-source-id: dc198aa6a58e1a7996e1831f1e479c398ffcbc90
2020-04-15 10:15:49 -07:00
cdfefa77a3 PR for double backwards of nn.Fold and nn.Unfold (issue #33452) (#36379)
Summary:
soumith ezyang albanD  After lots of experiments, I didn't manage to directly print the gradients of Fold/Unfold_backward (let me know if I am wrong).
Thus, in my testing codes, I compare the gradients of Fold/Unfold_backward implicitly by comparing the gradients of its following operation.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36379

Differential Revision: D21040646

Pulled By: ezyang

fbshipit-source-id: dafdbfe2c7b20efa535402c7f81fce5c681fce2f
2020-04-15 10:10:05 -07:00
9cac2b83d9 [pytorch] improve code analyzer to dump ops called from c++ functions (#35941)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35941

The key step of mobile custom build is to find out ops used by specific
model, with which it can produce a tailored build of optimal size.

However, ops can not only be called from TorchScript model but can also
be called from C++ code directly, e.g.: via torch::jit:: APIs. With
static dispatch, ops called this way will be statically linked into client
code. With dynamic dispatch, we need obtain & keep these ops explicitly.

This PR improves static code analyzer to dump ops that are called from
visible c++ symbols matching specific regex. This provides a mechanism
to solve the custom build problem with dynamic dispatch.

It starts with dumping ops that are callable from functions in torch::jit
namespace and include them in custom build with dynamic dispatch. We can
extend it to analyze custom code / to refine the set of JIT APIs that
are relevant, and etc. This is just a preliminary version. We need
improve its usability for more general purpose.

Test Plan: Imported from OSS

Differential Revision: D20835166

Pulled By: ljk53

fbshipit-source-id: a87cfb22b34f89545edd0674a5dfca6b7cff2b0c
2020-04-14 23:21:19 -07:00
f99a28f515 [ONNX] Adding a pass to replace interpolate function with aten::__interpolate (#35744)
Summary:
Since aten;:__interpolate is removed in https://github.com/pytorch/pytorch/pull/34514, we need a pass replace interpolate function with aten::__interpolate for ONNX export.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35744

Reviewed By: hl475

Differential Revision: D20907041

Pulled By: houseroad

fbshipit-source-id: f2d2cdfec47389245c50f538267124eedf682adf
2020-04-14 23:16:22 -07:00
cf27d07e04 Implementation of STORM optimizer caffe2 python wrapper (#36399)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36399

Added caffe2 python wrapper and unit test for the STORM C++ operator.

Test Plan:
All newly added unit tests passed using "buck test //caffe2/caffe2/python:optimizer_test -- TestStorm"

{F233644598}

Reviewed By: chocjy

Differential Revision: D18841013

fbshipit-source-id: f692bc18412839db140202ec9a971e556db0e54f
2020-04-14 23:05:45 -07:00
f7c9faab05 Implementation and operator test for STORM optimizer (#36225)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36225

Implemented the [STORM](https://arxiv.org/abs/1905.10018) optimizer operator for dense and sparse cases.

Test Plan:
All newly added unit tests passed using "buck test //caffe2/caffe2/python/operator_test:storm_test".

{F233643713}

Reviewed By: chocjy

Differential Revision: D18702897

fbshipit-source-id: d25eeb492aa2a03c69754d3f076a8239230b3bf4
2020-04-14 23:04:26 -07:00
84f4061a67 Back out "Revert D20147487: Refactor jit::Operator to more clearly distinguish the two possible states" (#36634)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36634

Original commit changeset: e69432dc3d03
ghstack-source-id: 102163937

Test Plan: waitforsandcastle

Differential Revision: D21023952

fbshipit-source-id: d1bad395cb0b4eda91a5d815291ac9b7bdb04573
2020-04-14 22:16:17 -07:00
70d3616aa1 [PyTorch] Split libtorch_sources into smaller filelists (#36583)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36583

To make them more reusable across different build systems
         Move `load()` directive at the head of `build_variables.bzl` inside function that uses them to make `build_variables.bzl` valid standalone python source file

Test Plan: CI + `python -c 'exec(open("tools/build_variables.bzl").read());print(libtorch_sources)'`

Reviewed By: EscapeZero

Differential Revision: D21018974

fbshipit-source-id: 3dbf2551620f164b8910270ad2c5c91125a9f5f0
2020-04-14 21:51:33 -07:00
91e59f5fe2 [PyTorch] Remove build definitions from build_variables.bzl (#36602)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36602

`build_variables.bzl` should contain only filelists to make it interpretable between BUCK, Cmake and Bazel build systems.

Test Plan: CI

Reviewed By: dzhulgakov

Differential Revision: D21022886

fbshipit-source-id: 9dd1e289ac502bc325e1223197b6156a316498ba
2020-04-14 21:50:01 -07:00
80b01ba4f3 [TensorBoard] fix #34954 (#36496)
Summary:
cc orionr sanekmelnikov
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36496

Differential Revision: D21012775

Pulled By: natalialunova

fbshipit-source-id: 2dc978d70d457b511bd13a3399246ae0349ff8ca
2020-04-14 19:23:47 -07:00
ceecca3324 Clang-format: whitelist test/cpp/tensorexpr/*. (#36616)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/36616

Test Plan: Imported from OSS

Differential Revision: D21027732

Pulled By: ZolotukhinM

fbshipit-source-id: f6504ae9c9c0872cee7f0ffcff3ad0e1e229b482
2020-04-14 19:09:37 -07:00
317f598103 [TensorExpr] Clang-format test/cpp/tensorexpr/*. (#36615)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/36615

Test Plan: Imported from OSS

Differential Revision: D21027733

Pulled By: ZolotukhinM

fbshipit-source-id: e19cd85c1634f4e40805814ac71eec719d6587f8
2020-04-14 19:08:18 -07:00
37aab14d14 [future] Avoid some future callback self-captures. (#36502)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36502

We're sometimes deleting futures without completing them (discovered by logging),
and we've recently noticed a slow memory leak.

This change migrates the future lambda cases where there was self-capture.
 - In some cases, we use weak_ptr<>, plus .lock()/assert in the lambda callback.
   This avoids the reference cycle. We use this primarily in the case where the
   value ends up being moved in the callback (something we want to be careful about)

 - We also add a convenience api to Future where the completed Future is returned as an arg.
   This allows us to avoid self-capture, though it assumes that the markCompleted()
   caller is persisting the future for the markCompleted() duration (this has been the case)

ghstack-source-id: 102130672

Test Plan: ctr_mobile_feed, buck test mode/dev-nosan caffe2/test/...

Differential Revision: D20998905

fbshipit-source-id: 7dd52fe4e567a5dea20e8d43862fc2335fd3ce16
2020-04-14 17:52:44 -07:00
1a0b95e7e4 bfloat16: enable basic math function (#35172)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/35172

Test Plan: Imported from OSS

Differential Revision: D20721146

Pulled By: ngimel

fbshipit-source-id: 25b2176d0a431706c51a7086e0642aff814d7148
2020-04-14 17:18:21 -07:00
73f11a0b23 Update qbatch_norm2d opbenchmark test (#36630)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/36630

Test Plan:
OMP_NUM_THREADS=1 python -m pt.qbatchnorm_test

Imported from OSS

Differential Revision: D21030508

fbshipit-source-id: 1ece1bd7429207732eae4dd1982ceddcdc5d3a91
2020-04-14 17:09:18 -07:00
67e0bf14b7 Add support of Dict as output when connecting script and tracing (#36265)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/36265

Reviewed By: hl475

Differential Revision: D20927160

Pulled By: houseroad

fbshipit-source-id: 5a63022e92d234b97b57d60ef7f7aa3bc41c2d22
2020-04-14 16:06:53 -07:00
ce3555a635 Relanding masked_select cuda port from TH to ATen (#36539)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/33054
Relanding PR https://github.com/pytorch/pytorch/issues/35429
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36539

Differential Revision: D21007226

Pulled By: ngimel

fbshipit-source-id: 3c66ad073ff8e767ad120bc94120379d40346018
2020-04-14 14:03:59 -07:00
9216c67c9e Revert D21021677: [pytorch][PR] Add core of c10::complex
Test Plan: revert-hammer

Differential Revision:
D21021677

Original commit changeset: 9e144e581fa4

fbshipit-source-id: ce6a88fc71ec0134d0fc6ecdddc4c4db35f89b1f
2020-04-14 13:58:24 -07:00
5150334c1d Unconditionally register schema even for manual registration. (#36250)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36250

The general concept is that I want a centralized location where you
can find all of the registrations for a library.  I cannot do this
if I don't codegen all of the schemas in one spot--right now,
most schemas get generated, but not manually registered ones. Let us
assume that manual registration has to do with the actual
implementations; nothing strange is going on with the schema
definition itself.  Make it so.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Differential Revision: D20929258

Pulled By: ezyang

fbshipit-source-id: 0a9fdc8eccd7b688b3e7bd8ed64b6e2af21978f4
2020-04-14 13:34:07 -07:00
6c742af235 Remove attributes and method of submodules in frozen module (#34787)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34787

This is a follow up patch of freezing of TorchScript modules. This patch
enables removal of constant attributes and unused method in submodules.
The clean up logic is generalized to handle  attributes that share their class
type.

Test Plan: Imported from OSS

Differential Revision: D21004990

Pulled By: bzinodev

fbshipit-source-id: 84778aa9ae1a96d23db29c051031f9995ed3ac90
2020-04-14 12:07:12 -07:00
01b121bd14 Fix bc test (#36588)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36588

-
ghstack-source-id: 102122179

(Note: this ignores all push blocking failures!)

Test Plan: -

Differential Revision: D21021653

fbshipit-source-id: b3693a1a5e27d28dc2fc772cbef5787ab4ceafaa
2020-04-14 11:43:06 -07:00
7390c333d6 [CI] fix test_distributed for python 3.8+ (#36542)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36542

Python 3.8 set the default multiprocessing start mode to spawn, but we
need fork in these tests, otherwise there are some pickling issues.
Test: Ensure that these tests succeed when run with python 3.8
ghstack-source-id: 102093824

Test Plan: Ensure success with python 3.8

Differential Revision: D21007753

fbshipit-source-id: 4b39844c6ba76a53293c0dfde7c98ec5a78fe113
2020-04-14 11:38:33 -07:00
25252816cf Add core of c10::complex (#35524)
Summary:
Step 0 of https://github.com/pytorch/pytorch/issues/35284

Reference: https://en.cppreference.com/w/cpp/numeric/complex
We are targeting C++20. The difference across C++ versions are mostly `constexpr` qualifiers, newer version has more function declared as `constexpr`

This PR adds the core of `c10::complex`, it includes
- standard constructors as in `std::complex`
- explicit conversion constructors converting from `std/thrust::complex` to `c10::complex`
- standard assignment operators as in `std::complex`
- conversion assignment operators converting from `std/thrust::complex` to `c10::complex`
- other standard operators as in `std::complex`
- standard methods as in `std::complex`
- explicit casting operators to std/thrust
- basic non-member functions as in `std::complex`:
  - arithmetic operators
  - `==`, `!=`
  - `<<`, `>>`
  - `std::real`, `std::imag`, `std::abs`, `std::arg`, `std::norm`, `std::conj`, `std::proj`, `std::polar`
    - Some of them are intentionally not completely implemented, these are marked as `TODO` and will be implemented in the future.

This PR does not include:
- overload of math functions

which will come in the next PR
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35524

Differential Revision: D21021677

Pulled By: anjali411

fbshipit-source-id: 9e144e581fa4b2bee62d33adaf756ce5aadc0c71
2020-04-14 11:00:24 -07:00
9a680056ad Remove extern C for TH_API (#36142)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/36142

Differential Revision: D20975451

Pulled By: anjali411

fbshipit-source-id: 1a42487f3af9be306cb08ddd8afa9b5e60545846
2020-04-14 10:55:52 -07:00
8a60d8bfe2 Create a new bionic image with clang9 (#36187)
Summary:
New images are already available at http://docker.pytorch.org/pytorch.html.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36187

Differential Revision: D21011545

Pulled By: ailzhang

fbshipit-source-id: 4fe98fa63110cb2ecb0194d4a8878fe9d2193611
2020-04-14 10:26:40 -07:00
4ebb1278e0 [quant] Update qbatch_norm name to qbatch_norm2d (#36494)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36494

Make name consistent with op. Since we have batch_norm2d and batch_norm3d ops

Test Plan:
python test/quantization/test_quantized.py test_batch_norm2d

Imported from OSS

Differential Revision: D21008831

fbshipit-source-id: f81ca71a331d5620fd6a3f6175020a30f2e2566b
2020-04-14 10:04:27 -07:00
f3f640d479 move test_abs to device-generic tests (#36465)
Summary:
Per title. test_abs used to be marked as slow_test and run on cpu only. Conceptually similar tests are done in TestTorchMathOps, so it's a matter of adding `abs` test there. 2 remaining checks (correct abs for large-valued long tensors, and correct abs for signed zeros) are factored into separate tests.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36465

Differential Revision: D21000248

Pulled By: ngimel

fbshipit-source-id: 8bc8b0da936b1c10fe016ff2f0dbb5ea428e7e61
2020-04-14 09:48:08 -07:00
4b3e3d8227 [improve logging] add the param information when logging the optimizer engine (#36558)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36558

In the log, frequently see a large trunk of Using engine xx for rowWise Adagrad, but without information on which parameter is applied.

Test Plan: Should be covered by existing testing that use optimizer

Reviewed By: chocjy

Differential Revision: D20985176

fbshipit-source-id: 6eb4e19e5307db53fc89b38594a3f303f1492a1c
2020-04-14 07:42:24 -07:00
d3cf9452af doc note on deterministic/non-deterministic gradient for min/max/median (#36481)
Summary:
An update on the note that the subgradients for min/max are not deterministic.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36481

Differential Revision: D20993887

Pulled By: albanD

fbshipit-source-id: 4e1a7519d94a9dcf9d359ad679360874d32c1fe2
2020-04-14 07:27:18 -07:00
69e3ee2d5f DataLoader: properly diagnose exceeding file descriptor limit (#34768)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/973

Common failure scenario:
* DataLoader creates workers and communicates with them through SHMs
* Workers send back through an AF_UNIX socket file descriptors to SHMs containing data
* The limit of open files gets fully used
* A FD gets stripped from a socket message coming back from a worker, without the worker knowing this.
* This causes a `RuntimeError: received 0 items of ancdata` in the standard `multiprocessing` package
* The exception is not handled by PyTorch and so is presented to the users.

After this change the user will see

```
Traceback (most recent call last):
  File "/home/wbaranowski/git/Quansight/pytorch/torch/utils/data/dataloader.py", line 761, in _try_get_data
    data = self._data_queue.get(timeout=timeout)
  File "/home/wbaranowski/miniconda3/envs/pytorch-cuda-dev/lib/python3.6/multiprocessing/queues.py", line 113, in get
    return _ForkingPickler.loads(res)
  File "/home/wbaranowski/git/Quansight/pytorch/torch/multiprocessing/reductions.py", line 294, in rebuild_storage_fd
    fd = df.detach()
  File "/home/wbaranowski/miniconda3/envs/pytorch-cuda-dev/lib/python3.6/multiprocessing/resource_sharer.py", line 58, in detach
    return reduction.recv_handle(conn)
  File "/home/wbaranowski/miniconda3/envs/pytorch-cuda-dev/lib/python3.6/multiprocessing/reduction.py", line 184, in recv_handle
    return recvfds(s, 1)[0]
  File "/home/wbaranowski/miniconda3/envs/pytorch-cuda-dev/lib/python3.6/multiprocessing/reduction.py", line 162, in recvfds
    len(ancdata))
RuntimeError: received 0 items of ancdata

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/wbaranowski/git/Quansight/pytorch/torch/utils/data/dataloader.py", line 787, in _try_get_data
    fs = [tempfile.NamedTemporaryFile() for i in range(10)]
  File "/home/wbaranowski/git/Quansight/pytorch/torch/utils/data/dataloader.py", line 787, in <listcomp>
    fs = [tempfile.NamedTemporaryFile() for i in range(10)]
  File "/home/wbaranowski/miniconda3/envs/pytorch-cuda-dev/lib/python3.6/tempfile.py", line 551, in NamedTemporaryFile
    (fd, name) = _mkstemp_inner(dir, prefix, suffix, flags, output_type)
  File "/home/wbaranowski/miniconda3/envs/pytorch-cuda-dev/lib/python3.6/tempfile.py", line 262, in _mkstemp_inner
    fd = _os.open(file, flags, 0o600)
OSError: [Errno 24] Too many open files: '/tmp/tmpnx_f6v_f'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "test_shm_leak.py", line 56, in <module>
    worker_init_fn=worker_init_fn
  File "/home/wbaranowski/git/Quansight/pytorch/torch/utils/data/dataloader.py", line 345, in __next__
    data = self._next_data()
  File "/home/wbaranowski/git/Quansight/pytorch/torch/utils/data/dataloader.py", line 861, in _next_data
    idx, data = self._get_data()
  File "/home/wbaranowski/git/Quansight/pytorch/torch/utils/data/dataloader.py", line 828, in _get_data
    success, data = self._try_get_data()
  File "/home/wbaranowski/git/Quansight/pytorch/torch/utils/data/dataloader.py", line 791, in _try_get_data
    "Too many open files. Communication with the"
RuntimeError: Too many open files. Communication with the workers is no longer possible. Please increase the limit using `ulimit -n` in the shell or change the sharing strategy by calling `torch.multiprocessing.set_sharing_strategy('file_system')` at the beginning of your code
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34768

Differential Revision: D20538053

Pulled By: ezyang

fbshipit-source-id: be4425cf2fa02aff61619b2b829c153cb1a867cb
2020-04-14 07:10:57 -07:00
ed2d1cb2c4 Revert D20147487: Refactor jit::Operator to more clearly distinguish the two possible states
Test Plan: revert-hammer

Differential Revision:
D20147487

Original commit changeset: 50ce10b56f2b

fbshipit-source-id: e69432dc3d03002516cd248c84cfea08531c81be
2020-04-14 06:31:31 -07:00
fb70b4fb93 [caffe2] Add support for std::shared_ptr<std::vector<TensorList>> in PackRecordsOp and UnPackRecordsOp (#36550)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36550

Separate dataset_ops changes into a separate diff.

Test Plan:
```
buck test caffe2/caffe2/python/operator_test:dataset_ops_test
```

AI/AF canary (tested with D20959214):
https://our.intern.facebook.com/intern/experiment_store/experiment/3298538636995/#commit1-commit2
https://our.intern.facebook.com/intern/experiment_store/experiment/2199027015376/#commit1-commit2

Reviewed By: yinghai

Differential Revision: D20988910

fbshipit-source-id: b37a7bfd131813e9472a5e2fa24d681d1ef19018
2020-04-14 03:43:21 -07:00
018c3420b8 Make dim, numel, element_size into prim ops (#36551)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36551

Before, those ops were special cased in the jit codegen but that blocks our unboxing refactoring.
Instead, make those regular prim ops.
ghstack-source-id: 102081858

Test Plan: waitforsandcastle

Differential Revision: D21009196

fbshipit-source-id: b90320fce589fc0553f17582b66a5a05d0fd32d1
2020-04-14 02:18:36 -07:00
dd64e738c5 Expunge TensorId from all DispatchKey names. (#36240)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36240

It's annoying, historical, and unnecessary (enum class is already
namespaced).  I did this codemod with:

```
git grep -l 'CPUTensorId' | xargs sed -i 's/CPUTensorId/CPU/g'
git grep -l 'CUDATensorId' | xargs sed -i 's/CUDATensorId/CUDA/g'
git grep -l 'VariableTensorId' | xargs sed -i 's/VariableTensorId/Autograd/g'
git grep -l 'HIPTensorId' | xargs sed -i 's/HIPTensorId/HIP/g'
git grep -l 'MSNPUTensorId' | xargs sed -i 's/MSNPUTensorId/MSNPU/g'
git grep -l 'XLATensorId' | xargs sed -i 's/XLATensorId/XLA/g'
git grep -l 'PrivateUse1_TensorId' | xargs sed -i 's/PrivateUse1_TensorId/PrivateUse1/g'
git grep -l 'PrivateUse2_TensorId' | xargs sed -i 's/PrivateUse2_TensorId/PrivateUse2/g'
git grep -l 'PrivateUse3_TensorId' | xargs sed -i 's/PrivateUse3_TensorId/PrivateUse3/g'
git grep -l 'AutocastTensorId' | xargs sed -i 's/AutocastTensorId/Autocast/g'
git grep -l '_PreAutogradTensorId' | xargs sed -i 's/_PreAutogradTensorId/_PreAutograd/g'
git grep -l 'TESTING_ONLY_GenericWrapperTensorId' | xargs sed -i 's/TESTING_ONLY_GenericWrapperTensorId/TESTING_ONLY_GenericWrapper/g'
git grep -l 'TESTING_ONLY_GenericModeTensorId' | xargs sed -i 's/TESTING_ONLY_GenericModeTensorId/TESTING_ONLY_GenericMode/g'
```

Then I did a git grep for remaining TensorId occurrences, and manually
killed those (mostly in codegen, and some docs that needed updating).

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Differential Revision: D20929255

Pulled By: ezyang

fbshipit-source-id: dc371b6aa6e6ea7c0a5660137c14debde806a09d
2020-04-13 23:33:44 -07:00
8f501f3083 Update internal invariants in the world of manuallyBoxedKernel (#36388)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36388

(This wants to make me barf).

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Differential Revision: D20964195

Pulled By: ezyang

fbshipit-source-id: 3699a02b16060d79dae9890bafeaafad9ad9ae60
2020-04-13 23:32:01 -07:00
076d46f826 [ROCm] Add debug flag (#36521)
Summary:
This kernel debug flag should help locate the issues we are observing on
some of the CI nodes.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36521

Differential Revision: D21010612

Pulled By: ezyang

fbshipit-source-id: d746e4eb0af832e770d2231bfee4154b6e703c19
2020-04-13 23:27:24 -07:00
6e7eaabf49 Lock optimizations for DistAutogradContainer. (#36529)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36529

DistAutogradContainer is a singleton for the entire process and has a
single lock that protects access to map keyed by context_id. Performance
profiling showed that this lock is a potential bottleneck for training. As a
result, in this PR, we have the following optimizations:

1) Shard the map into 256 buckets with each bucket having its own lock. This
would ensure we hold much finer grained locks.
2) sendReleaseContextRpc was being called under a lock, moved this to be
outside the lock.
ghstack-source-id: 102085139

Test Plan: waitforbuildbot

Differential Revision: D21003934

fbshipit-source-id: 55f80dd317311bce0efd3ca8ca617d071297b5dc
2020-04-13 23:11:22 -07:00
411ccce279 Revert D20936595: Make DispatchKeyExtractor forget about TensorOptions
Test Plan: revert-hammer

Differential Revision:
D20936595

Original commit changeset: c2f3cc567761

fbshipit-source-id: 1fcaa0484377e1580c08cd89fd0fcbdeb3f73f11
2020-04-13 23:05:21 -07:00
999d7f6ab2 [jit] tracer flag to guard risky behaivors (#36277)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36277

This PR introduce a flag to the tracer that guard the risky behaviors
like adding list/dict as output of the tracer. Currently to ensure not
BC breaking user, we throw warning if the tracer output is list, and
will throw error when the tracer output is dict to enforce using this
flag (next PR)

Test Plan: Imported from OSS

Differential Revision: D20998157

Pulled By: wanchaol

fbshipit-source-id: 0d2c55f1a263a48b1b92dd6ad54407815e0a6f72
2020-04-13 22:35:03 -07:00
d5ba39c25d [TensorExpr] Postpone insertion of Alloc/Free statements in computeAt. (#36526)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/36526

Test Plan: Imported from OSS

Differential Revision: D21004740

Pulled By: ZolotukhinM

fbshipit-source-id: 8ac8db0d4e31065e4fbd3e0cc27f15a15dcb141c
2020-04-13 22:30:00 -07:00
4d1ccafb4b [caffe2] Enable copying for caffe2::Tensor (#36468)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36468

Since `caffe2::Tensor` is now refcounted, enabling copy constructor and the copy assignment operator should be fine.

Test Plan:
```
buck test mode/dev //caffe2/caffe2:caffe2_test_cpu -- TensorTest
```

AI/AF canaries with changes up to D20959214:

https://our.intern.facebook.com/intern/experiment_store/experiment/3298538636995/#commit1-commit2
https://our.intern.facebook.com/intern/experiment_store/experiment/2199027015376/#commit1-commit2

AI/AF canaries on this diff:
https://our.intern.facebook.com/intern/ads/canary/425960191574068914/
https://our.intern.facebook.com/intern/ads/canary/425960179835413033/

Reviewed By: yinghai

Differential Revision: D20985924

fbshipit-source-id: ead5f5ceff23d0adc06d598128de16a5533d767b
2020-04-13 21:41:52 -07:00
Jie
289d52c120 Fixing SyncBN dgrad (#36382)
Summary:
Previous PR https://github.com/pytorch/pytorch/issues/22248 which provides support for variadic batch size across processes doesn't account the mean_dy/mean_dy_xmu on backward path, which produces wrong dgrad.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36382

Differential Revision: D20984446

Pulled By: ngimel

fbshipit-source-id: 80066eee83760b275d61e2cdd4e86facca5577fd
2020-04-13 21:08:31 -07:00
70b826a884 Make DispatchKeyExtractor forget about TensorOptions (#36290)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36290

The BackendSelect dispatch key gives us a way to extract backend-
specific dispatch keys from non-Tensor arguments without teaching
the DispatchKeyExtractor about them. Here we finish switching over
to the BackendSelect approach for factory functions and remove
TensorOptions from the set of types DispatchKeyExtractor needs to
consider.

Test Plan: Imported from OSS

Differential Revision: D20936595

Pulled By: bhosmer

fbshipit-source-id: c2f3cc56776197a792cae2a83aeaca995effaad2
2020-04-13 20:57:50 -07:00
36b273abc0 Refactor jit::Operator to more clearly distinguish the two possible states (#33905)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33905

jit::Operator is semantically either a c10 op or a jit-only op but that is represented in a set of member variables with intricate invariants about their values.
Making this explicitly represented in a c10::either reduces the number of possible states, removing many of the invalid ones.

Similarly, if it is a jit-only op, there were schema_string_ and schema_ of which only one could be set at any time. Using a c10::either there too.
ghstack-source-id: 102084054

Test Plan: unit tests

Differential Revision: D20147487

fbshipit-source-id: 50ce10b56f2b1f51c8279cef03077c861db3eaac
2020-04-13 20:54:42 -07:00
9fcb4ab393 Fix either::map naming (#33904)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33904

This was misnamed and should actually be either::fold.
ghstack-source-id: 102050883

Test Plan: it's just a rename

Differential Revision: D20148263

fbshipit-source-id: 5d2ed92230e20e8bb7dec26ac3f26de7f03a6e39
2020-04-13 20:53:28 -07:00
eb00bac2b5 Make FakeLowP tests work (#36525)
Summary:
Make the e2e FakeLowP python tests work with Glow lowering in OSS environment. Added a README.md as a guideline.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36525

Reviewed By: hyuen

Differential Revision: D21004706

Pulled By: yinghai

fbshipit-source-id: d182152e4a1a3368640bd7872cb9ea4d4bff4b02
2020-04-13 20:16:33 -07:00
8544591f5a Fix a segfault in DeviceThreadHandlePool and PoolWindow (#36416)
Summary:
This PR fixes a bug related to object destruction order across threads. The bug can cause segfaults during shutdown of processes that use libtorch.

See https://github.com/pytorch/pytorch/issues/36408 for more detail
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36416

Differential Revision: D21006321

Pulled By: ezyang

fbshipit-source-id: da97936d9f2ed3f3e3aba8a3a29b38314f04b57f
2020-04-13 20:10:46 -07:00
c7631716da Output more debugging information for reduce kernel (#35946)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/35946

Differential Revision: D21007660

Pulled By: ngimel

fbshipit-source-id: 83dc257c3d9ff722d30270214c413d8a16bcffc0
2020-04-13 19:43:38 -07:00
1e22717118 qnnpack hardswish - pytorch op integration (#36320)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36320

Hooks up the aten quantized hardswish op to the QNNPACK
path added in the previous PR.

Test Plan:
tests pass

will run benchmarking on mobile to confirm

Imported from OSS

Differential Revision: D20965043

fbshipit-source-id: e3f147268142103b5ea3f48610aa3b9837b7b61a
2020-04-13 19:08:07 -07:00
0964b662c3 qnnpack hardswish - LUTs (#36252)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36252

Adds a baseline hardswish kernel using LUTs in QNNPACK.
Performance is 1.9 GB/s on a Nexus 6 and 2.2 GB/s on Pixel 3 - same as other LUT based ops.

Enforcing scale and zp to be equal to the input, to match the server implementation.

There are some potential  improvements in rewriting this as NEON
kernels for a further speedup - saving that until later, if we need it.

Test Plan:
```
with-proxy ./scripts/build-local.sh
./build/local/hardswish-test

with-proxy scripts/build-android-armv7.sh
adb push ./build/android/armeabi-v7a/hardswish-* /data/qnnpack
adb shell
/data/qnnpack/hardswish-test
/data/qnnpack/hardswish-bench

with-proxy scripts/build-android-arm64.sh
adb push ./build/android/arm64-v8a/hardswish-* /data/qnnpack
/data/qnnpack/hardswish-test
/data/qnnpack/hardswish-bench
```

Imported from OSS

Differential Revision: D20965044

fbshipit-source-id: 982938361971513cb15873438e12c23a38e819e3
2020-04-13 19:06:59 -07:00
455d4aab64 [PyTorch Numeric Suite] Add weight compare API (#36186)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36186

Start PyTorch Numeric Suite under PyTorch quantization and add weight compare API to it.
ghstack-source-id: 102062165

Test Plan: buck test mode/dev caffe2/test:quantization -- 'test_compare_weights'

Differential Revision: D20903395

fbshipit-source-id: 125d84569837142626a0e2119b3b7657a32dbf4e
2020-04-13 19:02:00 -07:00
739351fac4 Fix linter warning: replace f-strings with str.format for Py2 compat (#35492)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/35492

Test Plan: Imported from OSS

Differential Revision: D20998727

Pulled By: drdarshan

fbshipit-source-id: 54f34a7649a2772ad030b456f1b50aba831ce2e0
2020-04-13 18:43:58 -07:00
501d9f33ab Fix clang format (#36544)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/36544

Differential Revision: D21007972

Pulled By: malfet

fbshipit-source-id: 5c252ac628553e00b6d55d29233272e14a3f2545
2020-04-13 18:36:48 -07:00
0b7e832325 Fix signed integer overflow in rng_test.h (#36421)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/36421

Test Plan: Imported from OSS

Differential Revision: D20978925

Pulled By: pbelevich

fbshipit-source-id: 30b6abb19abe70738a3a68427f3b3df67510fb48
2020-04-13 18:30:21 -07:00
fd008bd170 Make patterns in test_unmatched_annotations more flexible (#36422)
Summary:
To make them compatible with python3.7 and python3.8
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36422

Test Plan: CI

Differential Revision: D21006399

Pulled By: malfet

fbshipit-source-id: 725df277ff3e4479fc2c39d16a30fbf301fde9e5
2020-04-13 17:53:37 -07:00
1f40bddf57 [TensorBoard] fix #36471 (#36495)
Summary:
cc orionr sanekmelnikov

Confirm that the function was removed already.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36495

Differential Revision: D21003122

Pulled By: natalialunova

fbshipit-source-id: 364b0790953980e02eb7ff8fa0b6218d7e34a0c3
2020-04-13 17:49:16 -07:00
c49de6ce0d [TensorBoard] fix #33140 (#36497)
Summary:
cc orionr sanekmelnikov

The fix was ported from 9d267066a6
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36497

Differential Revision: D21002710

Pulled By: natalialunova

fbshipit-source-id: 0d2f3697c650bccdf6de52583de0d38c5c219261
2020-04-13 17:44:09 -07:00
0912284830 CI failure tips (#36507)
Summary:
Finding out how to ssh into a CircleCI job to debug a failure is a challenge because, as far as I know, there isn't any concise documentation about it. I figured it might be nice to include this in CONTRIBUTING.md.

Maybe there are some other tips about non-CircleCI jobs that could be added in the future as well.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36507

Differential Revision: D21006526

Pulled By: ezyang

fbshipit-source-id: 0a544ecf37bf9550e9b2f07595332dc5f394bb9e
2020-04-13 17:39:47 -07:00
b38d505e42 [shape inference] use max_seq_size as max_feature_len in SLS and LengthsRangeFill inference (#36346)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/36346

Reviewed By: yinghai, ipiszy

Differential Revision: D20952490

fbshipit-source-id: ac0e00b47be3fbfa908b84e062c83817dc326924
2020-04-13 17:28:43 -07:00
110893abf0 [Shape Inference] Infer input(1) from input(0) in elementwise ops (#36498)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/36498

Test Plan: unit test

Reviewed By: yinghai

Differential Revision: D20996757

fbshipit-source-id: 35e5e0bed63d2ba8a0699a8a78bff6f78e7af23c
2020-04-13 16:58:34 -07:00
c9a1fc2b31 replace Generator arguments with c10::optional<Generator> (#36232)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36232

The purpose of this PR is to replace `at::Generator generator = nullptr` with `c10::optional<at::Generator> = c10::nullopt` all over the code

* #36230 Replace std::shared_ptr with c10::intrusive_ptr in at::Generator

Test Plan: Imported from OSS

Differential Revision: D20943603

Pulled By: pbelevich

fbshipit-source-id: 65d335990f01fcc706867d5344e73793fad68ae6
2020-04-13 16:26:57 -07:00
5a7f889a11 Use bazel build rules from fbgemm (#36339)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/36339

Test Plan: CI

Differential Revision: D21004672

Pulled By: malfet

fbshipit-source-id: 8ce5b436686cfb70141104aa6722c8cc13609caa
2020-04-13 16:03:27 -07:00
3526627f46 Use unittest assertWarns instead (#36411)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36411

This PR remove pytorch specific defined assertwarns and use the unit
test one, also format some tests

Test Plan: Imported from OSS

Differential Revision: D20998159

Pulled By: wanchaol

fbshipit-source-id: 1280ecff2dd293b95a639d13cc7417fc819c2201
2020-04-13 15:56:42 -07:00
d7b7998370 Enable more tests in fbcode (#36418)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36418

Those tests were only run in oss before but should also run in fbcode.
ghstack-source-id: 101973722

Test Plan: waitforsandcastle

Reviewed By: ezyang

Differential Revision: D20976954

fbshipit-source-id: 7ced56dcbdbfe0e07993871a7811a086894b6b32
2020-04-13 15:51:53 -07:00
0035aeef40 [autograd] Avoid holding lock when completing GraphTask futureResult (#35101)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35101

TSAN is noting lock-order-inversion in context of dist autograd because
we're holding lock when GraphTask calls markCompleted() on the relevant futureResult_.

Add an atomic bool to make it possible to protect this without holding the mutex,
and also fix alignment of a few struct vars.

ghstack-source-id: 101805283

Test Plan: buck test mode/opt-tsan //caffe2/test/distributed/rpc:dist_autograd_spawn_thrift

Differential Revision: D20553517

fbshipit-source-id: 446e3718dd68876bd312166ecceed1d92868ce4e
2020-04-13 15:23:47 -07:00
765bf8f03d Remove duplicate bindings from torch/csrc/jit/python/init.cpp. (#36492)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/36492

Test Plan: Imported from OSS

Differential Revision: D20995235

Pulled By: ZolotukhinM

fbshipit-source-id: 6afa3a956e57c2fb94bb29d332177be73a2bac2a
2020-04-13 12:28:32 -07:00
ced9edbaa4 [Torch Device][c10] Fix the expected torch device error message (#36446)
Summary:
This PR made the expected torch device string error message to include `xla` as the acceptable torch device prefix string.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36446

Test Plan:
No Logic changed, and made sure `xla` is acceptable in `torch.device`.
```
import torch

device = torch.device("xla")
```

```
device = torch.device("unrecognized")

RuntimeError: Expected one of cpu, cuda, mkldnn, opengl, opencl, ideep, hip, msnpu, xla device type at start of device string: unrecognized
```

Differential Revision: D20993449

Pulled By: dahsh

fbshipit-source-id: 83afe4f913a650a655bfda9c2a64bf9e5aa27e16
2020-04-13 12:02:07 -07:00
d070c0bcf0 ROCm: enable cpp_extensions.load/load_inline (#35897)
Summary:
This enables cpp_extensions.load/load_inline. This works by hipify-ing cuda sources.
Also enable tests.
CuDNN/MIOpen extensions aren't yet supported, I propose to not do this in this PR.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35897

Differential Revision: D20983279

Pulled By: ezyang

fbshipit-source-id: a5d0f5ac592d04488a6a46522c58e2ee0a6fd57c
2020-04-13 11:44:08 -07:00
ce54f0d411 Back out "Revert D20449887: [dt][caffe2] enable using smart exceptions in async nets" (#36172)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36172

Original commit changeset: 3d7801613f86

D20449887 broke some OSS tests as the OSS export sync wasn't working correctly.

Test Plan:
Manually export latest version to OSS to trigger the tests

+ test plan in D20449887

verified onnx tests are passing in https://github.com/pytorch/pytorch/pull/36172

Reviewed By: andrewwdye

Differential Revision: D20902279

fbshipit-source-id: bc30fcc9f5cc8076f69a5d92675fd27455948372
2020-04-13 11:31:52 -07:00
d591a7bb82 Use Function to implement fork. (#36179)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36179

This ensures normal optimization passes run for forked functions.

Test Plan: Imported from OSS

Differential Revision: D20907253

Pulled By: zdevito

fbshipit-source-id: 72cfa9f82643214b1ef3de24697d163a9a24b29c
2020-04-13 11:21:48 -07:00
967cdc2baf Simplify replicate logic (#36174)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/36174

Test Plan: Imported from OSS

Differential Revision: D20903301

Pulled By: zdevito

fbshipit-source-id: 714a32fe417b7d1615886936c41505d1ba538f47
2020-04-13 11:21:43 -07:00
4f956fcf88 _requires_grad -> requires_grad (#36168)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36168

This makes it match what eager mode uses as the keyword name.
Currently _requires_grad will not appear in serialization because
it is not listed as kwarg only. There is a small chance there is a model
that has never been run in eager mode that uses the _requires_grad name,
but this is rare enough that I don't think we need to worry about it unless
something breaks in testing.

Test Plan: Imported from OSS

Differential Revision: D20902557

Pulled By: zdevito

fbshipit-source-id: 605cf5371b4fc15ec1b4e8a12f9660d723530de4
2020-04-13 11:20:27 -07:00
e3b6dd1708 [rref] Minor tweaks in rref_context (#36419)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36419

Since we call waitForThreadLocalPendingRRefs per-RPC, construct it
already-satisfied in the common empty case, to avoid extra mutex/cv work.

Also, naming consistency for recording_.
ghstack-source-id: 101975739

Test Plan: ctr_mobile_feed, buck test mode/dev-nosan caffe2/test/...

Differential Revision: D20977879

fbshipit-source-id: e321a33127e4b5797e44e039839c579057e778e5
2020-04-13 10:19:02 -07:00
2bc49a4b85 block_diag dense (#33449)
Summary:
Add block_diag function for dense tensors, based on scipy.linalg.block_diag

Closes https://github.com/pytorch/pytorch/issues/31932
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33449

Differential Revision: D20943099

Pulled By: zou3519

fbshipit-source-id: 8b5c9476fb5af959aafa4169612c660396d9b717
2020-04-13 10:04:55 -07:00
35cc2bbca3 Removed unnecessary call to '_strong_wolfe' in LBFGS. (#36453)
Summary:
It was called twice, but the result of the first invocation was not used.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36453

Differential Revision: D20993535

Pulled By: yf225

fbshipit-source-id: 4d85207a936b846866424903d7622905f3fddd36
2020-04-13 09:06:33 -07:00
1e15063761 ThroughputBenchmark: integration with Autograd Profiler (#36282)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36282

The reason to do this explicitly in the tool is that we don't want to capture warmup in profiling (as well as input cloning). So instead we make the benchmarking code explicitly aware of the profiler.

Example output:

```
I0408 16:06:40.300040 85516 throughput_benchmark-inl.h:106] Using Autograd profiler. Trace will be saved to /tmp/tmpt0gsz85y
I0408 16:06:40.302232 85516 throughput_benchmark-inl.h:111] Starting threads
I0408 16:06:40.302258 85524 throughput_benchmark-inl.h:78] Starting forward thread 1
I0408 16:06:40.302259 85525 throughput_benchmark-inl.h:78] Starting forward thread 2
I0408 16:06:40.302261 85523 throughput_benchmark-inl.h:78] Starting forward thread 0
I0408 16:06:40.302259 85526 throughput_benchmark-inl.h:78] Starting forward thread 3
I0408 16:06:40.412879 85525 throughput_benchmark-inl.h:88] Shutting down forward thread 2. Total number of finished threads: 1
I0408 16:06:40.412971 85523 throughput_benchmark-inl.h:88] Shutting down forward thread 0. Total number of finished threads: 2
I0408 16:06:40.412989 85526 throughput_benchmark-inl.h:88] Shutting down forward thread 3. Total number of finished threads: 3
I0408 16:06:40.413033 85524 throughput_benchmark-inl.h:88] Shutting down forward thread 1. Total number of finished threads: 4
I0408 16:06:40.413056 85516 throughput_benchmark-inl.h:123] Finished benchmark
Average latency per example: 443.256us
Total number of iterations: 1000
Total number of iterations per second (across all threads): 9024.12
Total time: 110.814ms
```

Test Plan: Imported from OSS

Differential Revision: D20987125

Pulled By: ezyang

fbshipit-source-id: 1f8980c3a5a0abdc268c7a16c99aa9ea868689eb
2020-04-13 08:53:40 -07:00
a2e059cfa6 add missing 'import warnings' (#35313)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35313

The intention of D16955662 was to print a warning when a single-layer LSTM has an (ignored) dropout specified. I ran into this warning with one of our models, but instead of a warning I got "name 'warnings' is not defined". The linter could have called out that problem on the original diff, not sure why it didn't.

Test Plan: Before this diff JITing a particular model in f176977725 yielded "name 'warnings' is not defined". After this diff f176980937 gets past that point (failing in an unrelated downstream workflow).

Reviewed By: jianyuh

Differential Revision: D20611822

fbshipit-source-id: 99d90f4830f3b15ddbf1e2146e2cc014ef26c2ab
2020-04-13 08:41:44 -07:00
379e4d9cad [pytorch] Make behavior of SobolEngine consistent w/ other RNG functions (#36427)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36427

Addresses https://github.com/pytorch/pytorch/issues/36341

Test Plan: unit tests

Reviewed By: ldworkin

Differential Revision: D20952703

fbshipit-source-id: 28055f4c4c0f8012c2d96e473b822fa455dd833c
2020-04-13 07:53:33 -07:00
d2e0c628e9 Updating submodules
Summary:
GitHub commits:

15e343ce0c

Test Plan: n/a

Reviewed By: zpao

fbshipit-source-id: 0955bbb3c6628319981f52e7b3076af1ae28ddfe
2020-04-13 00:00:48 -07:00
b92f8d9b7e Revert D20950587: [pytorch][PR] Added complex types to get_all_dtypes and turned on masked_fill for complex
Test Plan: revert-hammer

Differential Revision:
D20950587

Original commit changeset: ba7c372a28f0

fbshipit-source-id: 487ac59a971b1ecefd20fd446385ba12334d9695
2020-04-12 21:33:17 -07:00
6be8560375 Do not double compile generated files (#36417)
Summary:
Bazel puts generated files in its private hermetic builds, but for some reason also searches for files in `torch/csrcs/*/generated/` folders.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36417

Test Plan: Use the same folder to compile pytorch using cmake and bazel

Differential Revision: D20987580

Pulled By: malfet

fbshipit-source-id: 36d15ba3ce0d0c7ea923ddef902bd500f2578430
2020-04-12 15:16:38 -07:00
4bcd8ab6f7 Added complex types to get_all_dtypes and turned on masked_fill for complex (#36335)
Summary:
1. Added complex dtypes to get_all_dtypes to unify testing for complex dtypes with other dtypes so that they don't get out of sync with behavior supported for other dtypes.
2. resolves https://github.com/pytorch/pytorch/issues/36322, https://github.com/pytorch/pytorch/issues/36327
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36335

Differential Revision: D20950587

Pulled By: anjali411

fbshipit-source-id: ba7c372a28f007372b6f15adf7c52d3a09fd4007
2020-04-12 13:41:06 -07:00
d83509e603 [quant] Fix for the conv1d kernel shape (#36397)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/36397

Differential Revision: D20966295

Test Plan: Imported from OSS

Pulled By: z-a-f

fbshipit-source-id: bd2ab9dcfe22b900cff1ddffa60618fa8f703a1f
2020-04-11 22:34:46 -07:00
0c9bf64989 Disables complex clamp (#36373)
Summary:
This partially addresses https://github.com/pytorch/pytorch/issues/33568 by disabling clamp for complex inputs until an appropriate solution can be implemented. test_complex_unsupported in test_torch.py is extended to validate this behavior.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36373

Differential Revision: D20984435

Pulled By: mruberry

fbshipit-source-id: 49fd2e1e3a309f6a948585023953bae7ce3734c8
2020-04-11 22:24:06 -07:00
254be6a201 Adds NumPy array x Torch tensor binary ufunc interaction test (#35945)
Summary:
Adds test for behavior reported in https://github.com/pytorch/pytorch/issues/35257 to ensure it doesn't regress. The test was extended to reveal three additional issues:

- https://github.com/pytorch/pytorch/issues/36363
- https://github.com/pytorch/pytorch/issues/36058
- https://github.com/pytorch/pytorch/issues/36057
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35945

Differential Revision: D20984429

Pulled By: mruberry

fbshipit-source-id: a15be9455afba9c77e40c337a860f9be348bf8d5
2020-04-11 21:56:38 -07:00
4f728c9d81 [ONNX] Enable constant folding for Shape (#35386)
Summary:
Enabled constant folding for onnx::Shape
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35386

Reviewed By: hl475

Differential Revision: D20682412

Pulled By: houseroad

fbshipit-source-id: 4559a35f174edfb7e6364c0fbf5bc1d55d0d26dc
2020-04-11 13:49:52 -07:00
e3af0c9f9b [TensorExpr] Add new file bounds_inference.cpp to BUILD.bazel. (#36440)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/36440

Test Plan: Imported from OSS

Differential Revision: D20983368

Pulled By: ZolotukhinM

fbshipit-source-id: 5d847c5b066297e8e5585c870387165f89938e45
2020-04-11 13:19:18 -07:00
c1efe1ddb5 Enable building of FakeLowP ops (#36170)
Summary:
We open sourced the FakeLowp ops as a reference implementation of fp16 ops. This PR makes it buildable.

```
USE_CUDA=0 USE_ROCM=0 USE_FAKELOWP=ON python setup.py install
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36170

Test Plan:
Build Onnxifi library in Glow.
```
cp ${GLOW}/build/lib/Onnxifi/libonnxifi-glow.so ${MY_PATH}/ibonnxifi.so
LD_LIBRARY_PATH=${MY_PATH}/ibonnxifi.so python pytorch/caffe2/python/fakelowp/test_sls_nnpi_fp16.py
```

It doesn't run successfully right now because we need to open source the glow gflags and some other ops like `FbgemmPack`.

Reviewed By: houseroad

Differential Revision: D20980681

Pulled By: yinghai

fbshipit-source-id: 6dd31883a985850a77261bcc527029479bbc303f
2020-04-11 13:17:59 -07:00
7aa6a8fd7a Disables complex min and max (#36377)
Summary:
Partially addresses https://github.com/pytorch/pytorch/issues/36374 by disabling min and max for complex inputs. test_complex_unsupported in test_torch.py is extended to validate this behavior.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36377

Differential Revision: D20964661

Pulled By: mruberry

fbshipit-source-id: 79606c2e88c17c702543f4af75847d2460586c2d
2020-04-11 12:30:35 -07:00
7b9ab91614 Improve boxed dispatch performance (#33313)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33313

Instead of just remembering the number of arguments and iterating over the stack,
the DispatchKeyExtractor now remembers the exact locations of the dispatch relevant arguments
(i.e. Tensor arguments) and only looks at those.
ghstack-source-id: 101908386

Test Plan: unit tests, benchmarks

Differential Revision: D19748549

fbshipit-source-id: b5b9ff2233b3507e0b600460f422912cfa9e3f0f
2020-04-11 12:04:27 -07:00
22212a82b4 Remove functor factories in KernelFunction (#35488)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35488

-
The original problem why those existed was a SIOF (see the multi-line comment that is deleted in this PR).
However, I think this SIOF situation only happened for caffe2 kernels exposed to PyTorch and those now use a different mechanism that shouldn't cause the SIOF anymore (they now create the caffe2 kernel instance on each call instead of storing it in the functor). If this PR passes CI, I'm assuming that the SIOF doesn't exist anymore and we can simplify this code.
ghstack-source-id: 101933838

Test Plan: waitforsandcastle

Differential Revision: D20676093

fbshipit-source-id: 462e11f75f45d9012095d87f447be88416f5dcdc
2020-04-11 11:58:40 -07:00
91441ae87f [Lite Interpreter] Move Implicic ops to register_prim_ops.cpp (#36406)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36406

As title. Move the related operators so that they are available from lite interpreter.
ghstack-source-id: 101944177

(Note: this ignores all push blocking failures!)

Test Plan: CI

Reviewed By: ayush29feb

Differential Revision: D20958833

fbshipit-source-id: a755d4d662b9757d8d425b7a25f519aaad1fd330
2020-04-11 09:34:37 -07:00
df5f0a04ff [TensorExpr] Implement LoopNest::computeAt (#36112)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/36112

Differential Revision: D20885662

Test Plan: Imported from OSS

Pulled By: ZolotukhinM

fbshipit-source-id: 4ea6293b249562fca46739dc36c5483d912e5838
2020-04-11 04:01:14 -07:00
397aa46a3e [TensorExpr] Bounds inference (#35120)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/35120

Differential Revision: D20567926

Test Plan: Imported from OSS

Pulled By: ZolotukhinM

fbshipit-source-id: 89a2afcddaf23a5c6259c15e4f7194e8649c1c4d
2020-04-11 03:59:34 -07:00
c856a2cb0d Move unboxing to after dispatch for ops with manual kernel registrations (#36398)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36398

ghstack-source-id: 101935322

Test Plan: CI

Differential Revision: D20967226

fbshipit-source-id: 10e694bd7cd53e5efa1f21c5aa2c9ba3fac9ba44
2020-04-11 03:33:18 -07:00
7e8c27ed25 Fix view_complex_as_float for empty tensors (#36415)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/36415

Test Plan: Imported from OSS

Differential Revision: D20974194

Pulled By: pbelevich

fbshipit-source-id: afc19a47d585b7c0c33fcde922d10fa377194315
2020-04-11 03:18:10 -07:00
742c77971a Revert D20961711: [pytorch][PR] Returns float tensors for complex inputs to abs
Test Plan: revert-hammer

Differential Revision:
D20961711

Original commit changeset: 232f62cf64ca

fbshipit-source-id: 7b2a537d2effe6b2449f192dc42e375062058995
2020-04-11 02:55:41 -07:00
ae452a81a9 [DistAutograd x JIT] Capture global state, dist autograd current context id, before thread switching triggered by JIT future.wait() (#36395)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36395

titled

Test Plan:
# Unit tests

```
buck test mode/dev-nosan //caffe2/test/distributed/rpc/jit:dist_autograd_fork -- test_restore_context_after_swtich_to_jit_thread

buck build mode/dev-nosan //caffe2/test/distributed/rpc/jit:dist_autograd_fork

buck-out/gen/caffe2/test/distributed/rpc/jit/dist_autograd_fork\#binary.par -r test_restore_context_after_swtich_to_jit_thread
```

```
buck test mode/dev-nosan //caffe2/test/distributed/rpc:dist_autograd_fork -- test_backward_simple_script_call

buck build mode/dev-nosan //caffe2/test/distributed/rpc:dist_autograd_fork

buck-out/gen/caffe2/test/distributed/rpc/dist_autograd_fork\#binary.par -r test_backward_simple_script_call
```

Differential Revision: D7857991

fbshipit-source-id: 168e0e3846a50ea92d4f9450a30ccc6c13e2fcec
2020-04-11 02:51:39 -07:00
0dbb21f89e Revert D20931186: Enable c10 unboxing for ops with TensorList
Test Plan: revert-hammer

Differential Revision:
D20931186

Original commit changeset: 494723326070

fbshipit-source-id: c5e2386ad06acabaee05f6addcf1eb898f3d4ae0
2020-04-11 02:47:34 -07:00
409346eee3 Updating submodules
Summary:
GitHub commits:

1b6423cc1f

Test Plan: n/a

Reviewed By: zpao

fbshipit-source-id: 112f86c34c8f2f666271deb34e97d74f6f528696
2020-04-10 20:41:34 -07:00
5b331e8611 Catch exception in distributed engine callbacks. (#36118)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36118

Callbacks registered with the autograd engine Future in the
distributed engine have a non-trivial amount of business logic. Its entirely
possible that we throw exceptions in these callbacks resulting in those not
being propagated back to the client (since the appropriate future was not
marked as completed).

In this PR, I've added appropriate try-catch blocks to ensure we always mark
the appropriate Future with an error.
ghstack-source-id: 101904294

Test Plan: Tested by simulating an exception.

Differential Revision: D20885521

fbshipit-source-id: b6b6f5994a5fb439e40ec7c585435b6dfe7ddb8e
2020-04-10 19:41:30 -07:00
d71aeeceef Updating submodules
Summary:
GitHub commits:

16c94d8911
0bed508d4a
1b437911a9
79c102e8df
5c19a441c4
76f1dd09a9
9623728be6
2df3f7e68f
41e97f3303

Test Plan: n/a

Reviewed By: zpao

fbshipit-source-id: 02b3999a2ef53f78d521af12dff15d82c79f877e
2020-04-10 18:28:39 -07:00
e892398922 Upstream generic device test patch. (#36321)
Summary:
So that XLA can run all tests by setting env `PYTORCH_TEST_PATH` instead of patching a diff. :)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36321

Differential Revision: D20946635

Pulled By: ailzhang

fbshipit-source-id: 55ab7db7fd93063ad495a0c23a903218a29625a4
2020-04-10 16:59:48 -07:00
4305c7f97e Remove experimental c10 ops (#36394)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36394

Those are remains from the time when c10 was being constructed. They've fulfilled their goal of making sure that the c10 library supports all needed corner cases and those corner cases are now covered by actual ops. We don't need these experimental ops anymore.
ghstack-source-id: 101933837

Test Plan: CI

Differential Revision: D20965279

fbshipit-source-id: ff46f2482ff58ca3fa955288083b12ec2066938e
2020-04-10 16:11:16 -07:00
6920b13500 Move fakelowp tests from glow to caffe2 (#36409)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36409

Pull Request resolved: https://github.com/pytorch/glow/pull/4409

Since glow OSS doesn't really ship with python, it's much easier to do it in pytorch. All the glow dependency can be done though LD_LIBRARY_PATH in OSS.

Test Plan:
```
buck test caffe2/caffe2/python/fakelowp:
```

Reviewed By: amylittleyang

Differential Revision: D20969308

fbshipit-source-id: 06a02d23f4972a92beb18e1d052e27d8724539d0
2020-04-10 15:52:36 -07:00
bd4761123d Revert D20958928: [pytorch][PR] Port masked_select cuda from TH to ATen
Test Plan: revert-hammer

Differential Revision:
D20958928

Original commit changeset: 4704f5d2d271

fbshipit-source-id: 47eb440a74b7b1bd46b4a2aa1999e6de5aeb602b
2020-04-10 15:30:16 -07:00
86e8c49fae Revert D20523080: [pytorch] reduce memory footprint in fused conv QAT ops
Test Plan: revert-hammer

Differential Revision:
D20523080

Original commit changeset: 4a94047dee01

fbshipit-source-id: 66dce461c13dce794edb17fd7a32607d9c68a846
2020-04-10 15:23:54 -07:00
eddbee19a7 hardswish: add cuda kernels (#36350)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36350

Adds CUDA kernels for hardswish in order to unblock use in training.

Test Plan:
added test coverage for forward pass
ran this script for various input sizes to test backward pass against a manual Hardswish module: https://gist.github.com/vkuzo/30e196b059427725817f2ee934ed0384

Imported from OSS

Differential Revision: D20955590

fbshipit-source-id: 635706fbf18af9a4205f2309f3314f2996df904d
2020-04-10 13:53:37 -07:00
7576cf8d00 [caffe2] Use cpuinfo in perfkernels to simplify build dependency (#36371)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36371

It allows to drop circular dependency and remove unknown_symbols in Buck build.

It'd be good to get rid of GetCpuId all together in favor of cpuinfo, but it's not really blocking anything

Reviewed By: malfet

Differential Revision: D20958000

fbshipit-source-id: ed17a2a90a51dc1adf9e634af56c85f0689f8f29
2020-04-10 13:26:34 -07:00
343f2c0925 Port masked_select cuda from TH to ATen (#35429)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/33054

This PR does not directly depend on PR https://github.com/pytorch/pytorch/issues/33269 (the CPU counterpart), but whichever one of these two PRs gets merged last should remove `_th_masked_select` and  `_th_masked_select_bool` from `aten/src/ATen/Declarations.cwrap`.

Performance stats are here: https://github.com/pytorch/pytorch/issues/33054#issuecomment-591710014
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35429

Differential Revision: D20958928

Pulled By: ngimel

fbshipit-source-id: 4704f5d2d271f3669cecd4f41d266ec1f67ec7f2
2020-04-10 13:21:17 -07:00
d27dccfdaf Open source the missing part of FakeFp16 ops (#36353)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36353

ATT

Test Plan: buck build

Reviewed By: hyuen, amylittleyang

Differential Revision: D20953953

fbshipit-source-id: f6d2562dd3d123f0fcf4c912ed15053bf215d321
2020-04-10 13:16:13 -07:00
c029aaa25c Updating submodules
Summary:
GitHub commits:

0938cf4150
d600e5b0eb
15e3b9c3ad

Test Plan: n/a

Reviewed By: zpao

fbshipit-source-id: b3ac607796d6e67f5260bb5627474be7f2d45f2c
2020-04-10 13:07:46 -07:00
f999d600d0 Fix the typo in operator name string (#36296)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36296

When there's no overload name, the operator name string should be "name", instead of "name.".

Test Plan: Imported from OSS

Differential Revision: D20966759

Pulled By: iseeyuan

fbshipit-source-id: b4b31923c7ec5cdca8ac919bd6a84ba51afb6cd1
2020-04-10 12:56:16 -07:00
82be7c755a [pytorch] reduce memory footprint in fused conv QAT ops (#35002)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35002

I was running into some memory issues once I enabled QAT and I found some opportunities to in-place operations.  In particular, looks like we can do the ReLUs in-place and the bias addition seems to also work inline.  The multiplication operation right above the bias addition is *not* eligible because there's a bifurcation to produce conv_orig.

Reviewed By: jerryzh168

Differential Revision: D20523080

fbshipit-source-id: 4a94047dee0136f4014a328374896b28f561e41f
2020-04-10 12:39:35 -07:00
15c7486416 Canonicalize includes in c10, and add tests for it (#36299)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/36299

Test Plan: Imported from OSS

Differential Revision: D20943005

Pulled By: ezyang

fbshipit-source-id: 9dd0a58824bd0f1b5ad259942f92954ba1f63eae
2020-04-10 12:07:52 -07:00
42457e634d [TensorExpr] add support for Reduction Ops (#35866)
Summary:
Second attempt at the reduction frontend for the TensorExpr compiler. Has two APIs, a simple version for common reduction types and a customizable Reducer fronted which allows specifying initializer, reduction interaction via lambda and body via lambda.

Simple API looks like so:
```
Buffer b(BufHandle("b", {10}), kInt);
Tensor* c = Reduce("sum", {}, Sum(b), {{10, "m"}});
```

An example of specializing a Sum to do Matmul:
```
Buffer tA(BufHandle("tA", {M, K}), kFloat);
Buffer tB(BufHandle("tB", {K, N}), kFloat);
Sum matmul([&](ParameterList& v) {
  ExprHandle m = v[0];
  ExprHandle n = v[1];
  ExprHandle k = v[2];
  return tA(m, k) * tB(k, n);
});
Tensor* mm = Reduce("mm", {{M, "m"}, {N, "n"}}, matmul, {{K, "k"}});
```

A fully specialized Reduction:
```
VarHandle searchValue("searchValue", kInt);
Buffer b(BufHandle("b", {4, 10}), kInt);

Reducer anyEqSV(
    ExprHandle(0),
    [](ExprHandle a, ExprHandle b) {
      return CompareSelect::make(a, 1, 1, b, kEQ);
    },
    [&](ParameterList& v) {
      return CompareSelect::make(b.call(v), searchValue, kEQ);
    });

Tensor* any = Reduce("anyEqual", {{4, "i"}}, anyEqSV, {{10, "j"}});
```

 ---

Until lowering, Reductions are held in a compound form for easier optimization:
```
  VarHandle m("m", kInt);
  Buffer b(BufHandle("b", {2, 3, m}), kFloat);

  Tensor* c = Reduce("sum", {{2, "l"}, {3, "n"}}, Sum(b), {{m, "m"}});
  LoopNest loop({c});
  std::cout << *loop.root_stmt() << "\n";
```
```
for (int l = 0; l < 2; l++) {
  for (int n = 0; n < 3; n++) {
    for (int m = 0; m < m_1; m++) {
      sum[l, n] = ReduceOp(sum[l, n] = float(0);, (sum[l, n]) + (b[l, n, m]), {m});
    }
  }
}
```
```
  loop.prepareForCodegen();
  std::cout << *loop.root_stmt() << "\n";
```
```
for (int l = 0; l < 2; l++) {
  for (int n = 0; n < 3; n++) {
    sum[(0 + l * (1 * 3)) + n * 1] = float(0);
    for (int m = 0; m < m_1; m++) {
      sum[(0 + l * (1 * 3)) + n * 1] = (sum[(0 + l * (1 * 3)) + n * 1]) + (b[((0 + l * ((1 * m_1) * 3)) + n * (1 * m_1)) + m * 1]);
    }
  }
}
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35866

Differential Revision: D20965577

Pulled By: nickgg

fbshipit-source-id: afe506c90db794447180056417013bcaf0e2c049
2020-04-10 11:57:10 -07:00
5177906d67 [Shape Inference] Infer shape info for second input of elementwise ops (#36365)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/36365

Test Plan: unit test

Reviewed By: yinghai, ipiszy

Differential Revision: D20959518

fbshipit-source-id: bafbe4f87534398003a49ff5bd8f398b3a0f1473
2020-04-10 11:21:14 -07:00
4a98ba811c Enable c10 unboxing for ops with TensorList (#36330)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36330

-
ghstack-source-id: 101908458

Test Plan: CI

Differential Revision: D20931186

fbshipit-source-id: 4947233260704f6962865957a8f1a7b38dd6cb0a
2020-04-10 11:04:14 -07:00
e574ff3511 Updating submodules
Summary:
GitHub commits:

02f45c752f
9e89ffb776
3db8b846ab

Test Plan: n/a

Reviewed By: zpao

fbshipit-source-id: d5c9f07ec63a1eebf77a9e2b8cae11e2098016a1
2020-04-10 10:55:12 -07:00
d73ee763fc Fix the clang-format error caused in register prim ops change. (#36393)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36393

Fix the clang-format CI error caused in #35426

Test Plan: Imported from OSS

Differential Revision: D20964854

Pulled By: iseeyuan

fbshipit-source-id: 97f2ba1e006cac0f33b223315263b0b84c24cb15
2020-04-10 10:11:27 -07:00
247f2df840 Fixed include file header guard. (#36329)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36329

The same header guard was used in two different header files (not sure if this was intentional.)

Test Plan: CI Tests

Reviewed By: jspark1105

Differential Revision: D20946512

fbshipit-source-id: dd0190943a8c90059d480f15c05f3bfcce956acd
2020-04-10 10:11:22 -07:00
79973a16ce Add missing TORCH_API annotation (#36391)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36391

Without it I get

```
ImportError: /data/users/ezyang/pytorch-tmp/torch/lib/libtorch_python.so: undefined symbol: _ZN5torch3jit18checkDoubleInRangeEd
```

when I build with DEBUG=1

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Differential Revision: D20964292

Pulled By: ezyang

fbshipit-source-id: b2569f5813c6490de51372e70029648a36891e7a
2020-04-10 10:09:24 -07:00
b0c90fad93 Re-enable test_avg_pool3d_nhwc (#36259)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36259

Re-enable test disabled Related to #36129, which should be fixed by an
earlier PR #36103.

Test Plan: Imported from OSS

Differential Revision: D20933100

fbshipit-source-id: aca4e3b0b83a581fe58760b6730255b3176f41fc
2020-04-10 10:04:45 -07:00
1875c2e4bd Add torch.Tensor.as_subclass method. (#34369)
Summary:
This is according to pytorch/rfcs#3.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34369

Differential Revision: D20963929

Pulled By: ezyang

fbshipit-source-id: e618af6fd36e1dfaeda617162314ad5840f55358
2020-04-10 09:16:35 -07:00
7c825bad10 [RELAND] Add __torch_function__ benchmarks (#36138)
Summary:
Re-land of https://github.com/pytorch/pytorch/issues/35530 and https://github.com/pytorch/pytorch/issues/34645
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36138

Differential Revision: D20893770

Pulled By: ezyang

fbshipit-source-id: 75ab688a086f5fb87412a853df5246c0c39704ca
2020-04-10 09:14:31 -07:00
3aeb2b1562 Returns float tensors for complex inputs to abs (#35871)
Summary:
Per title. A test is added to test_type_promotion for the behavior. This behavior is consistent with NumPy's.

For complex inputs to `abs` the result is cast to float after the computation since the computation of abs must be performed on the original complex tensor. While `std::abs` returns a float value when called on complex inputs, returning a FloatTensor directly would require additional loop instantiations in TensorIterator. This may be worthwhile to pursue in the future.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35871

Differential Revision: D20961711

Pulled By: mruberry

fbshipit-source-id: 232f62cf64caa4154eb2194969efa51d2082d842
2020-04-10 09:08:45 -07:00
817e4f9ef1 Correct a ValueError in dataloader to TypeError (#36244)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/36244

Differential Revision: D20963949

Pulled By: ezyang

fbshipit-source-id: 8c6aa4831021788052269e7aa8282d11eba4e085
2020-04-10 09:03:58 -07:00
a91097bdfb Revert D20964368: Revert D20408831: [Lite Interpreter] Operator registration migrate from manual to selective build
Test Plan: revert-hammer

Differential Revision:
D20964368

Original commit changeset: f1874088a597

fbshipit-source-id: d9317ed97a98e2b04c190785b5564536b1096282
2020-04-10 08:19:36 -07:00
586481a6e2 Revert D20408831: [Lite Interpreter] Operator registration migrate from manual to selective build
Test Plan: revert-hammer

Differential Revision:
D20408831

Original commit changeset: ec75dd762c46

fbshipit-source-id: f1874088a5970dd220cc027d0020ab6223b9bd93
2020-04-10 08:03:38 -07:00
ee4cc96eee Vectorize in-place comparison operators (#35117)
Summary:
Benchmark: (Debian 10, Release build, gcc 8.3, no turbo, Intel(R)
Xeon(R) E-2136 CPU @ 3.30GHz)

```python
import timeit
for op in ('gt', 'lt', 'ge', 'le', 'eq', 'ne'):
    for dtype in ('torch.float', 'torch.double', 'torch.int16',
'torch.int32', 'torch.int64'):
        for n, t in [(10_000, 100000),
                    (100_000, 10000)]:
            print(f'a.{op}_(b), numel() == {n} for {t} times,
dtype={dtype}')
            print(timeit.timeit(f'a.{op}_(b)', setup=f'import torch; a =
torch.arange(1, {n}, dtype={dtype}); b = torch.arange({n}, 1, -1,
dtype={dtype})', number=t))
```

Before:

```
a.gt_(b), numel() == 10000 for 100000 times, dtype=torch.float
0.778998922000028
a.gt_(b), numel() == 100000 for 10000 times, dtype=torch.float
0.6359690249992127
a.gt_(b), numel() == 10000 for 100000 times, dtype=torch.double
1.0801493119997758
a.gt_(b), numel() == 100000 for 10000 times, dtype=torch.double
0.9360321379990637
a.gt_(b), numel() == 10000 for 100000 times, dtype=torch.int16
0.7341018620008981
a.gt_(b), numel() == 100000 for 10000 times, dtype=torch.int16
0.6345281440007966
a.gt_(b), numel() == 10000 for 100000 times, dtype=torch.int32
0.7396387640001194
a.gt_(b), numel() == 100000 for 10000 times, dtype=torch.int32
0.6429641230006382
a.gt_(b), numel() == 10000 for 100000 times, dtype=torch.int64
0.7759611700003006
a.gt_(b), numel() == 100000 for 10000 times, dtype=torch.int64
0.6672059659995284
a.lt_(b), numel() == 10000 for 100000 times, dtype=torch.float
0.7724312530008319
a.lt_(b), numel() == 100000 for 10000 times, dtype=torch.float
0.6392585769990546
a.lt_(b), numel() == 10000 for 100000 times, dtype=torch.double
0.7917451840003196
a.lt_(b), numel() == 100000 for 10000 times, dtype=torch.double
0.6455550159989798
a.lt_(b), numel() == 10000 for 100000 times, dtype=torch.int16
0.739991647998977
a.lt_(b), numel() == 100000 for 10000 times, dtype=torch.int16
0.6572993859990675
a.lt_(b), numel() == 10000 for 100000 times, dtype=torch.int32
0.7627949479992822
a.lt_(b), numel() == 100000 for 10000 times, dtype=torch.int32
0.6476544910001394
a.lt_(b), numel() == 10000 for 100000 times, dtype=torch.int64
0.7965036850000615
a.lt_(b), numel() == 100000 for 10000 times, dtype=torch.int64
0.6780715599998075
a.ge_(b), numel() == 10000 for 100000 times, dtype=torch.float
0.7653547080008138
a.ge_(b), numel() == 100000 for 10000 times, dtype=torch.float
0.6383065829995758
a.ge_(b), numel() == 10000 for 100000 times, dtype=torch.double
0.7895260240002244
a.ge_(b), numel() == 100000 for 10000 times, dtype=torch.double
0.6508346030004759
a.ge_(b), numel() == 10000 for 100000 times, dtype=torch.int16
0.7409299750015634
a.ge_(b), numel() == 100000 for 10000 times, dtype=torch.int16
0.6383492870008922
a.ge_(b), numel() == 10000 for 100000 times, dtype=torch.int32
0.7620547579990671
a.ge_(b), numel() == 100000 for 10000 times, dtype=torch.int32
0.6474270239996258
a.ge_(b), numel() == 10000 for 100000 times, dtype=torch.int64
0.8070051169997896
a.ge_(b), numel() == 100000 for 10000 times, dtype=torch.int64
0.6712598600006459
a.le_(b), numel() == 10000 for 100000 times, dtype=torch.float
0.7627660060006747
a.le_(b), numel() == 100000 for 10000 times, dtype=torch.float
0.6406353189995571
a.le_(b), numel() == 10000 for 100000 times, dtype=torch.double
1.0826010620003217
a.le_(b), numel() == 100000 for 10000 times, dtype=torch.double
0.9391552950000914
a.le_(b), numel() == 10000 for 100000 times, dtype=torch.int16
0.7427801039993938
a.le_(b), numel() == 100000 for 10000 times, dtype=torch.int16
0.6365172640016681
a.le_(b), numel() == 10000 for 100000 times, dtype=torch.int32
0.7679271510005492
a.le_(b), numel() == 100000 for 10000 times, dtype=torch.int32
0.6453389289999905
a.le_(b), numel() == 10000 for 100000 times, dtype=torch.int64
0.788032889000533
a.le_(b), numel() == 100000 for 10000 times, dtype=torch.int64
0.6708840760002204
a.eq_(b), numel() == 10000 for 100000 times, dtype=torch.float
1.078837263999958
a.eq_(b), numel() == 100000 for 10000 times, dtype=torch.float
0.9397531720005645
a.eq_(b), numel() == 10000 for 100000 times, dtype=torch.double
1.1031508050000411
a.eq_(b), numel() == 100000 for 10000 times, dtype=torch.double
0.9412319389994082
a.eq_(b), numel() == 10000 for 100000 times, dtype=torch.int16
0.7509566959997755
a.eq_(b), numel() == 100000 for 10000 times, dtype=torch.int16
0.638570957000411
a.eq_(b), numel() == 10000 for 100000 times, dtype=torch.int32
0.7592877549996047
a.eq_(b), numel() == 100000 for 10000 times, dtype=torch.int32
0.6458840529994632
a.eq_(b), numel() == 10000 for 100000 times, dtype=torch.int64
0.7984061539991671
a.eq_(b), numel() == 100000 for 10000 times, dtype=torch.int64
0.6776346309998189
a.ne_(b), numel() == 10000 for 100000 times, dtype=torch.float
0.7724407899986545
a.ne_(b), numel() == 100000 for 10000 times, dtype=torch.float
0.6581534130000364
a.ne_(b), numel() == 10000 for 100000 times, dtype=torch.double
0.8303323249983805
a.ne_(b), numel() == 100000 for 10000 times, dtype=torch.double
0.6954390920000151
a.ne_(b), numel() == 10000 for 100000 times, dtype=torch.int16
0.745512373998281
a.ne_(b), numel() == 100000 for 10000 times, dtype=torch.int16
0.6360954970004968
a.ne_(b), numel() == 10000 for 100000 times, dtype=torch.int32
0.7569978400006221
a.ne_(b), numel() == 100000 for 10000 times, dtype=torch.int32
0.6450422030011396
a.ne_(b), numel() == 10000 for 100000 times, dtype=torch.int64
0.7889118379989668
a.ne_(b), numel() == 100000 for 10000 times, dtype=torch.int64
0.6693385389989999
```

After:

```
a.gt_(b), numel() == 10000 for 100000 times, dtype=torch.float
0.2444220920006046
a.gt_(b), numel() == 100000 for 10000 times, dtype=torch.float
0.2031730359994981
a.gt_(b), numel() == 10000 for 100000 times, dtype=torch.double
0.35491806199934217
a.gt_(b), numel() == 100000 for 10000 times, dtype=torch.double
0.3905606850003096
a.gt_(b), numel() == 10000 for 100000 times, dtype=torch.int16
0.16665379499863775
a.gt_(b), numel() == 100000 for 10000 times, dtype=torch.int16
0.10095906300011848
a.gt_(b), numel() == 10000 for 100000 times, dtype=torch.int32
0.21650469999985944
a.gt_(b), numel() == 100000 for 10000 times, dtype=torch.int32
0.18737469400002738
a.gt_(b), numel() == 10000 for 100000 times, dtype=torch.int64
0.35481256200000644
a.gt_(b), numel() == 100000 for 10000 times, dtype=torch.int64
0.36696120199849247
a.lt_(b), numel() == 10000 for 100000 times, dtype=torch.float
0.21976138800164335
a.lt_(b), numel() == 100000 for 10000 times, dtype=torch.float
0.20275393200063263
a.lt_(b), numel() == 10000 for 100000 times, dtype=torch.double
0.3695997209997586
a.lt_(b), numel() == 100000 for 10000 times, dtype=torch.double
0.39441510399956314
a.lt_(b), numel() == 10000 for 100000 times, dtype=torch.int16
0.15657078300137073
a.lt_(b), numel() == 100000 for 10000 times, dtype=torch.int16
0.0992998069996247
a.lt_(b), numel() == 10000 for 100000 times, dtype=torch.int32
0.20425128799979575
a.lt_(b), numel() == 100000 for 10000 times, dtype=torch.int32
0.20352934599941364
a.lt_(b), numel() == 10000 for 100000 times, dtype=torch.int64
0.35883567900054913
a.lt_(b), numel() == 100000 for 10000 times, dtype=torch.int64
0.39059587599876977
a.ge_(b), numel() == 10000 for 100000 times, dtype=torch.float
0.21457727400047588
a.ge_(b), numel() == 100000 for 10000 times, dtype=torch.float
0.18836135499986995
a.ge_(b), numel() == 10000 for 100000 times, dtype=torch.double
0.35971907199927955
a.ge_(b), numel() == 100000 for 10000 times, dtype=torch.double
0.3688875009993353
a.ge_(b), numel() == 10000 for 100000 times, dtype=torch.int16
0.1576009280015569
a.ge_(b), numel() == 100000 for 10000 times, dtype=torch.int16
0.09524034199966991
a.ge_(b), numel() == 10000 for 100000 times, dtype=torch.int32
0.2064543649994448
a.ge_(b), numel() == 100000 for 10000 times, dtype=torch.int32
0.18726435600001423
a.ge_(b), numel() == 10000 for 100000 times, dtype=torch.int64
0.35351785300008487
a.ge_(b), numel() == 100000 for 10000 times, dtype=torch.int64
0.3680737989998306
a.le_(b), numel() == 10000 for 100000 times, dtype=torch.float
0.2132134399998904
a.le_(b), numel() == 100000 for 10000 times, dtype=torch.float
0.2140274829998816
a.le_(b), numel() == 10000 for 100000 times, dtype=torch.double
0.36539215199991304
a.le_(b), numel() == 100000 for 10000 times, dtype=torch.double
0.39128020300086064
a.le_(b), numel() == 10000 for 100000 times, dtype=torch.int16
0.15712150600120367
a.le_(b), numel() == 100000 for 10000 times, dtype=torch.int16
0.10149904400168452
a.le_(b), numel() == 10000 for 100000 times, dtype=torch.int32
0.2103407699996751
a.le_(b), numel() == 100000 for 10000 times, dtype=torch.int32
0.2134442910009966
a.le_(b), numel() == 10000 for 100000 times, dtype=torch.int64
0.35387034300038067
a.le_(b), numel() == 100000 for 10000 times, dtype=torch.int64
0.38917528399906587
a.eq_(b), numel() == 10000 for 100000 times, dtype=torch.float
0.2190484450002259
a.eq_(b), numel() == 100000 for 10000 times, dtype=torch.float
0.2030815980015177
a.eq_(b), numel() == 10000 for 100000 times, dtype=torch.double
0.3710030169986567
a.eq_(b), numel() == 100000 for 10000 times, dtype=torch.double
0.36419657899932645
a.eq_(b), numel() == 10000 for 100000 times, dtype=torch.int16
0.15986497499943653
a.eq_(b), numel() == 100000 for 10000 times, dtype=torch.int16
0.10145393699895067
a.eq_(b), numel() == 10000 for 100000 times, dtype=torch.int32
0.21011781599918322
a.eq_(b), numel() == 100000 for 10000 times, dtype=torch.int32
0.20121852699958254
a.eq_(b), numel() == 10000 for 100000 times, dtype=torch.int64
0.36681504499938455
a.eq_(b), numel() == 100000 for 10000 times, dtype=torch.int64
0.364472848999867
a.ne_(b), numel() == 10000 for 100000 times, dtype=torch.float
0.2290963309988001
a.ne_(b), numel() == 100000 for 10000 times, dtype=torch.float
0.21674784300012107
a.ne_(b), numel() == 10000 for 100000 times, dtype=torch.double
0.3829616689999966
a.ne_(b), numel() == 100000 for 10000 times, dtype=torch.double
0.39437660300063726
a.ne_(b), numel() == 10000 for 100000 times, dtype=torch.int16
0.1661020749997988
a.ne_(b), numel() == 100000 for 10000 times, dtype=torch.int16
0.10052955100036343
a.ne_(b), numel() == 10000 for 100000 times, dtype=torch.int32
0.21827425599985872
a.ne_(b), numel() == 100000 for 10000 times, dtype=torch.int32
0.21522501399886096
a.ne_(b), numel() == 10000 for 100000 times, dtype=torch.int64
0.37058242300008715
a.ne_(b), numel() == 100000 for 10000 times, dtype=torch.int64
0.39304063900090114
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35117

Differential Revision: D20721181

Pulled By: pbelevich

fbshipit-source-id: 4e38cc0f42393483db91b2dc53ffc507f91ed904
2020-04-10 06:50:07 -07:00
7fcf8b0a3b [Lite Interpreter] Operator registration migrate from manual to selective build (#35426)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35426

Use selective build with the full set of operators (vs. manually register each used op with "_" prefix).

Lite interpreter relies on JIT operator dispatch. In future we still need JIT operator dispatch dispatch ops that are not registered in c10.
Currently the selective build is for c10/aten dispatch in BUCK. There is JIT selective code-gen in OSS but not ported to BUCK yet.
This diff is also porting the selective code-gen in BUCK.
* The selected op list is passed to gen_jit_dispatch.py.
* The list passed to gen_jit_dispatch is the top-level ops (USED_PT_OPS) only, because the selective c10/aten dispatch already registered other ops that are called from the top-level ops.

ghstack-source-id: 101885215

(Note: this ignores all push blocking failures!)

Test Plan:
1. In Python, run torch.jit.export_opnames(scripted_M_mod)
2. Append the operator names into fbcode/caffe2/pt_ops.bzl and the BUCK target.
3. Run
```
buck run xplat/caffe2/fb/lite_predictor:lite_predictor_bi -- --model=/home/myuan/temp/bi_pytext_0315.bc --input_dims "1,4" --input_type int64 --pytext_len=4
```
Should provide expected results.
In addition, the size of the generated code for JIT registration, for example, ```register_aten_ops_0.cpp```, should be significantly reduced (from ~250 KB to ~80KB). The non-selected op registration schema are still kept, but the registration functor is replaced by ```DUMMY_OPERATION```

Reviewed By: ljk53

Differential Revision: D20408831

fbshipit-source-id: ec75dd762c4613aeda3b2094f5dad11804dc9492
2020-04-10 02:31:32 -07:00
9a4bc67f66 [caffe2/detectron2] fix Mask R-CNN caffe2 conversion on GPU (#36366)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36366

fix issues introduced in D20528758

Reviewed By: linbinyu

Differential Revision: D20959367

fbshipit-source-id: 920055a6782b9c6729177f7101f2f9eb3e40ebf8
2020-04-10 01:12:13 -07:00
31dca07fa5 Updating submodules
Summary:
GitHub commits:

20b6cf14e2

Test Plan: n/a

Reviewed By: zpao

fbshipit-source-id: f5a64d847512d25a138f666d25e03c58277327f0
2020-04-09 23:49:34 -07:00
37c1bd2946 Move FakeFP16 back to internal to remove dependency on MKL (#36297)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36297

Pull Request resolved: https://github.com/pytorch/FBGEMM/pull/343

We moved FakeFP16 back to close source and kept `RoundToFloat16` function in "fbgemm/FbgemmConvert.h".

This is because FakeFP16 introduced dependency on MKL in the FBGEMM core. Also it doesn't seem to be needed for open source, as it is not used anywhere.

Test Plan: CI

Reviewed By: jspark1105

Differential Revision: D20937962

fbshipit-source-id: 9487a9fd2282b6df2f754c22bea36f2255a5c791
2020-04-09 23:03:04 -07:00
2ec6a30722 Bump produced file format version (#36085)
Summary:
This was left off of #35741, but the max supported file format change
has been landed for several weeks, so this should be fine to land.
](https://our.intern.facebook.com/intern/diff/20875051/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36085

Pulled By: driazati

Reviewed By: eellison

Differential Revision: D20875051

fbshipit-source-id: c3b84c85d791cb6f286a2ed38ca5cd1219b332b2
2020-04-09 22:52:49 -07:00
aac36a89ff [model transform] tuple to arglist jit pass (#36093)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36093

Unwrap any tuples (including NamedTuples) in the module forward
function input list to be arglist.
1. Supports multiple tuple inputs, and traces their use through CallMethods and
TupleIndex
2. Does not unwrap inner use of other tuples that did not show up in the
original toplevel graph inputs

We work from the ScriptModule level instead of the Graph level because:
1. If the ScriptModule was previously called with the original set of inputs, the GraphExecutor caches the ExecutionPlan (specifically, ArgumentSpecCreator is derived from the Graph and type check the inputs passed in)
2. Since we are changing this graph's inputs, we clone the module and clear the GraphExecutor.

Since we work from ScriptModule level, we cannot take advantage of jit level syntactic sugar like run_pass(), so I jit exposed this as a cpp extension. Let me know if there are other ideas about this.

Test Plan:
buck test caffe2/torch/fb/model_transform:signature_translation_test
Todo: Verify use in bento

Untranslated graph:
```
> graph(%self : __torch__.test_jit.SparseNNWrapper,
>       %inputs.1 : NamedTuple(dense : Tensor, sparse : Dict(int, Tensor))):
>   %2 : __torch__.test_jit.SparseNN = prim::GetAttr[name="main_module"](%self)
>   %4 : Tensor = prim::CallMethod[name="forward"](%2, %inputs.1) # /data/users/ansha/fbsource/fbcode/buck-out/dev/gen/caffe2/test/jit#binary,link-tree/test_jit.py:12141:23
>   return (%4)
```

Translated graph:
```
> graph(%self : __torch__.test_jit.___torch_mangle_1.SparseNNWrapper,
>       %inputs.1_0 : Tensor,
>       %inputs.1_1 : Dict(int, Tensor)):
>   %2 : __torch__.test_jit.___torch_mangle_2.SparseNN = prim::GetAttr[name="main_module"](%self)
>   %3 : Tensor = prim::CallMethod[name="forward"](%2, %inputs.1_0, %inputs.1_1) # /data/users/ansha/fbsource/fbcode/buck-out/dev/gen/caffe2/test/jit#binary,link-tree/test_jit.py:12141:23
>   return (%3)
```

Reviewed By: houseroad

Differential Revision: D20313673

fbshipit-source-id: fddd07c9537dc8b6f480a14d697bea10ecc74470
2020-04-09 22:05:43 -07:00
391a36a59c Updating submodules
Summary:
GitHub commits:

54dc44ba8a
726a62cb89
955756111e
66a95f0fac
f1d333089f

Test Plan: n/a

Reviewed By: zpao

fbshipit-source-id: 3585db200a5d84ba01b8cf90a3bd7bf003314345
2020-04-09 21:55:10 -07:00
891a533b24 Adding Conv1d to quantization default_mappings (#36352)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/36352

Test Plan: Imported from OSS

Differential Revision: D20955781

Pulled By: z-a-f

fbshipit-source-id: 37fbcf329a6abcd9a367a73ad65ce543ed9ffe47
2020-04-09 21:41:13 -07:00
3d7c9abbf7 Refactor thread_reduce for better unrolling and vectorization in the future (#36014)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36014

Benchmark on RTX2080Ti: 2.13ms vs 1.88ms
https://github.com/zasdfgbnm/things/blob/master/2020Q2/reduction-benchmark-refactor.ipynb

Test Plan: Imported from OSS

Differential Revision: D20927535

Pulled By: ngimel

fbshipit-source-id: b65b749b58cebe0751e4ec7e1cf359543c401580
2020-04-09 20:03:17 -07:00
d9227bb311 Target 4096 blocks instead of split to large grid for large reduction (#35997)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35997

When the number of blocks is large enough, we are already achieving
blalanced SM allocation. But we still should keep the number of inputs
per thread large, because thread reduce is cheap.

Benchmark for Half on V100:
https://github.com/zasdfgbnm/things/blob/master/2020Q2/reduction-benchmark.ipynb

On large tensor, it is: 1.37ms vs 1.25ms

Test Plan: Imported from OSS

Differential Revision: D20927533

Pulled By: ngimel

fbshipit-source-id: 40df52e439cc1c01cda66c6195b600f301c5e984
2020-04-09 20:00:53 -07:00
2f5b523cd0 Remove unnecessary whitespace in complex tensors (#36331)
Summary:
This PR addresses Issue https://github.com/pytorch/pytorch/issues/36279.
Previously, printing of complex tensors would sometimes yield extra spaces before the elements as shown below:
```
print(torch.tensor([[1 + 1.340j, 3 + 4j], [1.2 + 1.340j, 6.5 + 7j]], dtype=torch.complex64))
```
would yield
```
tensor([[(1.0000 + 1.3400j),
         (3.0000 + 4.0000j)],
        [(1.2000 + 1.3400j),
         (6.5000 + 7.0000j)]], dtype=torch.complex64)
```
This occurs primarily because when the max width for the element is being assigned, the formatter's max_width is calculated prior to truncating the float values. As a result, ```self.max_width``` would end up being much longer than the final length of the element string to be printed.

I address this by adding a boolean variable that checks if a complex tensor contains only ints and change the control flow for calculating ```self.max_width``` accordingly.

Here are some sample outputs of both float and complex tensors:

```
tensor([[0., 0.],
        [0., 0.]], dtype=torch.float64)

tensor([[(0.+0.j), (0.+0.j)],
        [(0.+0.j), (0.+0.j)]], dtype=torch.complex64)

tensor([1.2000, 1.3400], dtype=torch.float64)

tensor([(1.2000+1.3400j)], dtype=torch.complex64)

tensor([[(1.0000+1.3400j), (3.0000+4.0000j)],
        [(1.2000+1.3400j), (6.5000+7.0000j)]], dtype=torch.complex64)

tensor([1.0000, 2.0000, 3.0000, 4.5000])

tensor([(1.+2.j)], dtype=torch.complex64)
```

cc ezyang anjali411 dylanbespalko
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36331

Differential Revision: D20955663

Pulled By: anjali411

fbshipit-source-id: c26a651eb5c9db6fcc315ad8d5c1bd9f4b4708f7
2020-04-09 19:35:05 -07:00
d916cf05d4 [quant][test] Split TestQuantizeScript to two TestCase (#36354)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36354

TestQuantizeScript is splitted to
- TestQuantizeScriptJitPasses
- TestQuantizeScriptPTSQOps (post training static quantization ops)

Test Plan:
.

Imported from OSS

Differential Revision: D20956731

fbshipit-source-id: 860cd24ea3f49450126ce2d872894492bdc822d8
2020-04-09 19:18:22 -07:00
8cb1950805 [JIT] fix alias assertion (#36178)
Summary:
AnyType wasn't listed as a mutable type, so the assertion triggered (yay!). Also update the `isMutableTypeInternal(from) != isMutableTypeInternal` logic to be more encompassing.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36178

Differential Revision: D20922356

Pulled By: eellison

fbshipit-source-id: 7060a62b18e98dc24b6004a66225c196aadb566e
2020-04-09 18:25:18 -07:00
48bf3eef1a [ONNX] disable size optimizations for onnx (#36243)
Summary:
Reviving this PR https://github.com/pytorch/pytorch/issues/35401 eellison. I believe after the profiled graph executor fix the test failures are handled.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36243

Differential Revision: D20950623

Pulled By: eellison

fbshipit-source-id: 5fbee426d1a098d84d5938540d45ce00828299be
2020-04-09 18:17:42 -07:00
c5662dd5dc Base class for the quantized ConvTranspose (#35370)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/35370

Test Plan: Imported from OSS

Differential Revision: D20641812

Pulled By: z-a-f

fbshipit-source-id: 42bb1ed96d6b6e0a5da6e693d02ff616c33d9ef6
2020-04-09 17:52:03 -07:00
7374a00bef [pt]Supported benchmarking pytorch jit self-contained models. (#35279)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35279

Supported benchmarking pytorch jit self-contained models.
* By specifying flag `--no_inputs=True`, the binary supports benchmarking self-contained torchscript model (model runs without inputs, `model.forward()`)
* This allows moving data preparation part outside of this binary.

Reviewed By: kimishpatel

Differential Revision: D20585639

fbshipit-source-id: c28e50503534c90023c1430479d26f1c1ce740b1
2020-04-09 17:02:17 -07:00
f2bae8e869 [quant][fix] at::print for per channel affine quantized tensors (#36280)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/36280

Test Plan:
.

Imported from OSS

Differential Revision: D20948352

fbshipit-source-id: 92188806b9c129458ebb2cdc47599427e3b6e216
2020-04-09 16:27:00 -07:00
51456dc808 Updating submodules
Summary:
GitHub commits:

12efe7c0ad
af9585915d
07c56b2c42
4a29e9cbc3
1aaa4f7c22
832671379a
26862c2f23
819d357723
9f18c234d9
81b956b41c

Test Plan: n/a

Reviewed By: zpao

fbshipit-source-id: 643525fbe02fc99e1258047ae13f0fe3704e3709
2020-04-09 16:19:45 -07:00
358466f1da [quant] Move graph mode quantization tests to test_quantize_script.py (#36324)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/36324

Test Plan:
.

Imported from OSS

Differential Revision: D20948046

fbshipit-source-id: 2dd8f0c6fbe8fd84293420b97592dc586d25def9
2020-04-09 16:10:18 -07:00
14ce500a9b Appropriately handle exceptions in autograd engine. (#36019)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36019

Once the autograd engine is finished with a GraphTask it would call
`markCompleted` on the Future. This could trigger callbacks on the Future that
could throw exceptions.

If one of the callbacks did throw an exception, we would call setErrorIfNeeded,
which would be no-op since the Future is already marked as completed. This
would effectively mean we would be swallowing exceptions. To avoid this, we do
the following:

1) Rethrow the exception in `mark_graph_task_completed`.
2) In `setErrorIfNeeded`, log the error if we are ignoring it.
ghstack-source-id: 101607329

Test Plan: Verified appropriate logging.

Differential Revision: D20854806

fbshipit-source-id: 76bdf403cfd6d92f730ca1483ad5dba355f83e58
2020-04-09 15:18:03 -07:00
9662ef66b7 Fix torch.min docs (#36319)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36319

On the way to resolving #35216.
This is a fix for just the master branch but once this goes in,
I'll send a cherry-pick to release/1.5

The problem is that we were not calling `format` on a string that had
templates (e.g., '{input}', '{dim}'). This change makes it so that we
call format on the entire docstring for `torch.min`.

Test Plan:
- The `torch.max` docs are OK:
https://pytorch.org/docs/master/torch.html#torch.max and don't need
changing.
- `torch.min` docs, before this change: see second screenshot in #35216.
- after this change: <Insert link here on github>

![image](https://user-images.githubusercontent.com/5652049/78921702-4e2acc00-7a63-11ea-9ea0-89636ff6fb0a.png)

Differential Revision: D20946702

Pulled By: zou3519

fbshipit-source-id: a1a28707e41136a9bb170c8a4191786cf037a0c2
2020-04-09 15:10:59 -07:00
1ffc2d9ace Updating submodules
Summary:
GitHub commits:

fe67bb7c0e
027c1644a7
8045a2a068
05953181a9
faeba96985
e860f8840a
a8a1113de5
959cdee731

Test Plan: n/a

Reviewed By: zpao

fbshipit-source-id: 652dbf7024ea374506fa5c46440a1ab9d630e28b
2020-04-09 15:03:00 -07:00
2de3e491a8 [RELAND] Add temporary impl_UNBOXED syntax sugar for unboxed-only defs. (#36223)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36223

Previously #35714

There are a lot of unboxed only defs.  We're committed to removing
them at the end of the half but as I am about to do a lot of porting
to the new API, let's get them into a form where they're easy to
remove.  This is a new overload impl_UNBOXED that will pass
the function pointer straight to CppFunction::makeUnboxedOnly

I don't attempt to make the _UNBOXED API complete; in particular,
catchall declarations don't get this sugar (as there are very few
of them).

To get some coverage of _UNBOXED API for code analysis, I switched
one of our unboxed tests to be an impl rather than a def.  This
shouldn't materially affect coverage.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Differential Revision: D20929259

Pulled By: ezyang

fbshipit-source-id: 72d2061b6c8a6afbcd392b47f53ade18de2f9184
2020-04-09 14:58:33 -07:00
ef07bb65e9 [RELAND] Add DispatchKey impl overload; remove use of torch::dispatch (#36222)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36222

Reland of #35706, with fixes to code analyzer.

It is extremely common to define implementations of operators at a
specific dispatch key, so we add an overload to impl specifically for
this case.  I then delete most uses of torch::dispatch

dispatch_autograd call sites can't make use of this overload.  So
instead the new preferred way to specify something as autograd is to
pass kAutograd as the dispatch key (short form, analogous to kCPU/kCUDA
which we support today).

I flip flopped about whether or not kAutograd should have the type
DispatchKey or some other type (to help better encapsulate the
DispatchKey enum); this is more direct and I can't think of any
BC problems from this usage.

Some other reorganization I did:
- I renamed all of the worker functions in op_registration to have
  a leading underscore and made them private, just to make it more
  clear what the public versus private API were (the private API
  shouldn't be used by users because it doesn't come with && overloads)
  Note that this means I needed to adjust the regex in the
  code analyzer, because
- In a few places where I was touching lines already, I replaced
  full DispatchKey typed out enums with shorter kFoo names, similar
  to kAutograd but I didn't publish these globally.
- Code analyzer now prints a unified diff, and in the other order
  (because I tend to think of the diff as reporting how the /new/ result
  is different)

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Differential Revision: D20929256

Pulled By: ezyang

fbshipit-source-id: c69b803d2b3a1a8aff70e14da33d3adec5239f13
2020-04-09 14:56:55 -07:00
477f1c047c [TensorExpr] add simplication of constant branches to IR Simplifier (#36257)
Summary:
Adds handling of constant branches to the TensorExpr IR Simplifier. This covers both IfThenElse and Cond when the condition expression is a known constant (e.g. `IfThenElse(1, X, Y) => X`), or when both arms of the branch are the same (e.g. `IfThenElse(Y, X, X) => X`).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36257

Differential Revision: D20947777

Pulled By: nickgg

fbshipit-source-id: 974379e42a6d65ce3e7178622afb62d36ad4e380
2020-04-09 14:45:13 -07:00
90c7db8ae3 caffe2/core/plan_executor: add cancellation of async nets on error + propagate exceptions via std::exception_ptr for stack traces (#31966)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31966

This has three parts:
* When `--caffe2_handle_executor_threads_exceptions` is set when a parallel execution step throws an exception it can hang waiting for async nets to finish. This adds cancellation code to cancel any async nets.
* This makes the exceptions returned from parallel workers pass a std::exception_ptr so the stack trace can be recorded with folly::SmartExceptionTracer.
* Define Cancel method at NetBase level to avoid pulling in unsupported AsyncSchedulingNet for fbandroid.

Test Plan:
Added unit tests for plan_executor

  buck test //caffe2/caffe2:caffe2_test_cpu
  buck test //caffe2/caffe2:caffe2_test_cpu -- PlanExecutorTest --stress-runs 100

Reviewed By: boryiingsu

Differential Revision: D19320177

fbshipit-source-id: d9939fcea1317751fa3de4172dfae7f781b71b75
2020-04-09 14:38:18 -07:00
88c22070fe Revert D20768930: add quantized layer norm implementation
Test Plan: revert-hammer

Differential Revision:
D20768930

Original commit changeset: ddf8727e9840

fbshipit-source-id: a190e1d1e42281eba627b0dbb6de1b3651cd5e97
2020-04-09 14:36:37 -07:00
d51ad40fe1 [quant][onnx] Mark upsample_nearest2d, sigmoid and reshape as no scale (#36325)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36325

return the scale of the input tensor

Test Plan:
python test/onnx/test_pytorch_onnx_caffe2_quantized.py

Imported from OSS

Differential Revision: D20947338

fbshipit-source-id: 71fc15fce815972d23804ff7cf936da997e71dc0
2020-04-09 14:31:55 -07:00
e551bfc8de New CUDA Fuser code lowering refactor (#36199)
Summary:
This PR completely refactors the code lowering process from our IR to CUDA. Before we had one giant step that would go from a relatively high level IR straight to CUDA, now we're lowering this first into concepts like ForLoop, IfThenElse, TensorIndex, Allocate. This lowering will allow us to do more complex code lowering like reductions and unrolling. Unrolling will quickly follow this PR.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36199

Reviewed By: dzhulgakov

Differential Revision: D20925220

Pulled By: soumith

fbshipit-source-id: 8f621c694c68a1aad8653e625d7287fe2d8b35dc
2020-04-09 14:27:05 -07:00
f0ea6862ba Support for pruning delays in Adagrad Optimizer (#34527)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34527

Adding support for prune_delays and prune ratios in Adagrad optimizer.

Test Plan:
Tested via unit tests in masked_adagrad_optimizer_test. Added unit test  for prune_delay versions of MaskedAdagrad

buck build caffe2/caffe2/fb/optimizers:masked_adagrad_optimizer_test; buck-out/gen/caffe2/caffe2/fb/optimizers/masked_adagrad_optimizer_test#binary.par

buck test caffe2/caffe2/fb/dper/layer_models/tests/split_1:sparse_nn_test -- 'test_pruning'

All Dper tests passed https://our.intern.facebook.com/intern/testinfra/testrun/7599824380741217

Reviewed By: chocjy

Differential Revision: D20313419

fbshipit-source-id: 5c2c8d4e0fc2ec538bcd6f145c6b87a2381f90f3
2020-04-09 12:59:23 -07:00
376542c83d caffe2: preserve python exception type from PythonOp (#36267)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36267

This makes PythonOp throw the original python exception instead of wrapping it in a c10::Error type. This allows throwing exceptions from Python and preserving the type when they're caught again in Python. This is important for structured logging and handling non-retryable error types.

Test Plan: buck test caffe2/caffe2/python:python_op_test

Reviewed By: wenqicaofb

Differential Revision: D20928098

fbshipit-source-id: 001747f022c657b420f8450b84d64f4d57f6cdf6
2020-04-09 12:43:24 -07:00
8493383e94 remove some code part never been called (#35033)
Summary:
At title. Found that  `THBlas_(swap)` was never be used. So I remove it from repo. Please help review patch, and any suggestions are welcomed.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35033

Differential Revision: D20918998

Pulled By: albanD

fbshipit-source-id: 93af8429231421185db0ccdfdd44e349a8f68c67
2020-04-09 12:36:52 -07:00
264da24c9e Fixing RPC Shutdown and Thread Joining (#36239)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36239

ProcessGroupAgent and ThriftAgent threads were joined at shutdown, but RpcAgent threads were joined by the destructor. This PR joins all threads at shutdown by using a pattern similar to `start` in RPC.

The derived classes implement a `shutdownImpl` class that cleans up backend-specific state. RpcAgent implements `shutdown` which cleans up generic state and calls the underlying `shutdownImpl`. The atomic running is now set and unset by RpcAgent so backends do not need to mutate it.
ghstack-source-id: 101820415

Test Plan: Ensured this works with `test_duplicate_name` (in which RpcAgent is constructed but PGA is not), and selected `rpc_spawn` and `dist_autograd_spawn` tests with TSAN. Checking Build Bot and CI as well, and continuing to test more with TSAN on devserver (currently running into memory issues).

Reviewed By: jjlilley

Differential Revision: D20902666

fbshipit-source-id: 5dbb5fc92ba66f75614c050bb10b10810770ab12
2020-04-09 12:32:00 -07:00
9497b21e63 Grad input padding support for dilation argument (#33872)
Summary:
Fix https://github.com/pytorch/pytorch/issues/16012

It replaces https://github.com/pytorch/pytorch/pull/20684 that has gone stale and simply adds tests on top of it.
These calls used to crash, they now work and return the same value as the backward using the autograd engine.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33872

Differential Revision: D20148360

Pulled By: albanD

fbshipit-source-id: 1113f1a25be238570fa8900fc1be658b61a47802
2020-04-09 11:09:55 -07:00
e311e53abe Revert D18672405: Revert D18672405: Use codegen'ed unboxing wrappers (#36010)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36010

-
ghstack-source-id: 101645932

Test Plan: CI

Differential Revision: D20577433

fbshipit-source-id: f80fb62de68d1d11ea05dd1f694b36356a06189b
2020-04-09 10:45:28 -07:00
866227cfb3 [pt][quant] Add vector path to copy kernel for quantized data types (#36189)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36189

We only had a scalar path for the copy kernel for quantized data types. This diff adds a vector path. It should improve all the ops where copy is used. This results in 10x better performance for mul_scalar in one of the benchmarked models.

### Before:
```
 -------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------
Name                       Self CPU total %  Self CPU total   CPU total %      CPU total        CPU time avg     Number of Calls
-------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------
quantize_per_tensor        0.16%            171.287us        0.16%            171.287us        171.287us        1
quantized::conv2d          56.65%           58.830ms         56.65%           58.830ms         387.040us        152
quantized::add_scalar      6.02%            6.256ms          6.02%            6.256ms          67.270us         93
quantized::relu6           2.04%            2.121ms          2.04%            2.121ms          22.808us         93
quantized::mul_scalar      19.33%           20.076ms         19.33%           20.076ms         215.876us        93
quantized::mul             13.79%           14.320ms         13.79%           14.320ms         124.520us        115
quantized::add             1.17%            1.215ms          1.17%            1.215ms          43.388us         28
adaptive_avg_pool2d        0.04%            41.684us         0.64%            661.083us        28.743us         23
_adaptive_avg_pool2d       0.60%            619.399us        0.60%            619.399us        26.930us         23
sigmoid                    0.17%            180.745us        0.17%            180.745us        8.216us          22
dropout                    0.00%            1.798us          0.00%            1.798us          1.798us          1
view                       0.01%            8.529us          0.01%            8.529us          8.529us          1
dequantize                 0.01%            7.481us          0.01%            7.481us          7.481us          1
-------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------
Self CPU time total: 103.849ms
```

### After:
```
 -------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------
Name                       Self CPU total %  Self CPU total   CPU total %      CPU total        CPU time avg     Number of Calls
-------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------
quantize_per_tensor        0.23%            193.581us        0.23%            193.581us        193.581us        1
quantized::conv2d          68.66%           58.702ms         68.66%           58.702ms         386.197us        152
quantized::add_scalar      7.11%            6.082ms          7.11%            6.082ms          65.401us         93
quantized::relu6           2.40%            2.056ms          2.40%            2.056ms          22.104us         93
quantized::mul_scalar      2.34%            2.001ms          2.34%            2.001ms          21.513us         93
quantized::mul             16.85%           14.410ms         16.85%           14.410ms         125.308us        115
quantized::add             1.34%            1.149ms          1.34%            1.149ms          41.033us         28
adaptive_avg_pool2d        0.05%            46.415us         0.78%            667.620us        29.027us         23
_adaptive_avg_pool2d       0.73%            621.205us        0.73%            621.205us        27.009us         23
sigmoid                    0.25%            215.650us        0.25%            215.650us        9.802us          22
dropout                    0.00%            2.503us          0.00%            2.503us          2.503us          1
view                       0.01%            11.608us         0.01%            11.608us         11.608us         1
dequantize                 0.01%            9.221us          0.01%            9.221us          9.221us          1
-------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------
Self CPU time total: 85.500ms
```

Test Plan: buck test //caffe2/test:quantization -- 'test_qtensor_copy'  --print-passing-details

Reviewed By: jspark1105

Differential Revision: D20906956

fbshipit-source-id: d538b8dc0d031ce61cb1b0af14a1c012976d75b1
2020-04-09 10:43:18 -07:00
1443db8dc3 [TensorExpr] fix bug in IRSimplifier when multiplying by 0 (#36287)
Summary:
In the IR Simplifier we were not treating multiply by zero specially, which meant some constant expressions were stored in formats that were not constant.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36287

Differential Revision: D20937497

Pulled By: nickgg

fbshipit-source-id: 528e430313ea048524d7a4a0256eef4a0297438b
2020-04-09 09:55:16 -07:00
5061ef63f4 Revert "Revert D20885968: [pytorch][PR] Enable backtrace with MSVC" (#36205)
Summary:
This reverts commit 8afa001d898914a48d6b9e3d944a99607d2819c1 and made a few improvements including the following items.
1. return `std::string` for `get_module_base_name`
2. eliminate `module should always be true` warning
3. do `SymInitialize` and `SymCleanup` once to save time
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36205

Reviewed By: malfet

Differential Revision: D20919672

Pulled By: ezyang

fbshipit-source-id: 0063a478779feb106459af48063485ef676008a5
2020-04-09 09:48:41 -07:00
423b01431b make vendor match with this implementation (#36302)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36302

used 1/sqrt(x) vs rsqrt(x)

Test Plan:
tested with the seed from testwarden 1586230820
tested without the seed

Differential Revision: D20939672

fbshipit-source-id: c7be030c4ae42e78765edda2ce1ad2e213a46030
2020-04-09 09:42:34 -07:00
f813e7184e add quantized layer norm implementation (#35329)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35329

Adds a quantized implementation of LayerNorm for server.

A future PR will add the Python wrapper.

Test Plan:
numerics match the floating point implementation

benchmarks by input size:
v1 (mean+var non-vectorized): https://gist.github.com/vkuzo/f6d72c04742608112f4c2e612c74bd13
v2 (mean+var vectorized in float): https://gist.github.com/vkuzo/4dd95657c5b5f3654e0965db00eff8d2
v3 (mean+var vectorized in int, current): https://gist.github.com/vkuzo/57a75f75629da9f23b64b38ca0e3d34b

Imported from OSS

Differential Revision: D20768930

fbshipit-source-id: ddf8727e9840c65ead3b890220af0638c5637028
2020-04-09 09:11:41 -07:00
23e5f6a7be Add avx2 integer horizontal sum and sum of squares to vec256 qint types (#35693)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35693

Adds utility functions to quantized int types of vec256 to calculate
horizontal sums and sums of squares using avx2 intrinsics.

This is useful for quantized implementations of various normalization
layers (LayerNorm, GroupNorm, InstanceNorm), where we need to calculate
the mean and variance of a layer of quantized ints.

Test Plan:
Adhoc c++ tester for the correctness of the avx2 functions:
https://gist.github.com/vkuzo/0380f450793cd5c05abbeacb6d3883ae
Run with:
```
-lstdc++ -mavx2 -lm -ldl -o main main.cpp && ./main
```

The integration bits and performance will be tested in the next PR in the stack
where we will hook quantized Layernorm to use this.

Imported from OSS

Differential Revision: D20768804

fbshipit-source-id: 4720dd358dde0dabbab8e1a33a67be55925d98f9
2020-04-09 09:10:10 -07:00
126d00c8dd [pytorch] move force schema registration output into a separate file (#36284)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36284

ATen/gen.py's `force_schema_registration` flag was added in #34622 to unblock c10 boxing for custom build,
as full-JIT frontend expects certain op schemas are always registered (the actual op implementation can be
skipped if it's not used).
The flag didn't work together with `per_op_registration` flag, which was added for FB BUCK selective build.
This PR made it work with `per_op_registration` flag, by moving schema registrations to a separate file.

This way, internal full-JIT can include the new source file while lite-JIT can ignore it. OSS custom build
should still work as before.

Updated table of codegen flags and 5 build configurations that are related to mobile:
```
+--------------------------------------+-----------------------------------------------------------------------------+--------------------------------------------+
|                                      |                              Open Source                                    |                  FB BUCK                   |
+--------------------------------------+---------------------+---------------------------+---------------------------+---------------+----------------------------+
|                                      |    Default Build    | Custom Build w/ Stat-Disp | Custom Build w/ Dyna-Disp |   Full-JIT    |         Lite-JIT           |
+--------------------------------------+---------------------+---------------------------+---------------------------+---------------+----------------------------+
| Dispatch Type                        | Static              | Static                    | Dynamic                   | Dynamic       | Dynamic                    |
+--------------------------------------+---------------------+---------------------------+---------------------------+---------------+----------------------------+
| ATen/gen.py                          |                     |                           |                           |               |                            |
+--------------------------------------+---------------------+---------------------------+---------------------------+---------------+----------------------------+
| --op_registration_whitelist          | unset               | used root ops             | closure(used root ops)    | unset         | closure(possibly used ops) |
| --backend_whitelist                  | CPU Q-CPU           | CPU Q-CPU                 | CPU Q-CPU                 | CPU Q-CPU     | CPU Q-CPU                  |
| --per_op_registration                | false               | false                     | false                     | true          | true                       |
| --force_schema_registration          | false               | true                      | true                      | true          | true (output unused)       |
+--------------------------------------+---------------------+---------------------------+---------------------------+---------------+----------------------------+
| tools/setup_helpers/generate_code.py |                     |                           |                           |               |                            |
+--------------------------------------+---------------------+---------------------------+---------------------------+---------------+----------------------------+
| --disable-autograd                   | true                | true                      | true                      | false         | WIP                        |
| --selected-op-list-path              | file(used root ops) | file(used root ops)       | file(used root ops)       | unset         | unset                      |
| --selected-op-list (WIP)             | unset               | unset                     | unset                     | unset         | used root ops              |
| --force_schema_registration (WIP)    | false               | true                      | true                      | true          | false                      |
+--------------------------------------+---------------------+---------------------------+---------------------------+---------------+----------------------------+
```
ghstack-source-id: 101840182

Test Plan:
- check OSS CI;
- patch D20577433 on top of this change to make sure test passes on it;
- check mobile build size bot;

Differential Revision: D20932484

fbshipit-source-id: 5028a6f90f2c7ee66fc70c562643b536a32b4d33
2020-04-09 09:00:52 -07:00
fdf7a833e7 Address printing inconsistency between float and complex tensors (#35841)
Summary:
See issue [https://github.com/pytorch/pytorch/issues/33494 Complex number printing inconsistent with float](https://github.com/pytorch/pytorch/issues/33494).

Changes introduces an optional argument in Formatter's ```format``` function to discern whether a tensor is a float tensor or not. This way, there is consistency between float tensors and complex tensors so that the complex tensors print in the same manner as float tensors:

- Only a decimal point and no zeros for integer values.
- Trailing zeros only if the value is truly a float.
- White space introduced to fill the gap so that +/- symbols and commas align.

Here are some example outputs.

```
print(torch.zeros((2,2), dtype=torch.float64))
```
yields
```
tensor([[0., 0.],
        [0., 0.]], dtype=torch.float64)
```

```
print(torch.zeros((2,2), dtype=torch.complex64))
```
previously yielded
```
tensor([[(0.0000 + 0.0000j), (0.0000 + 0.0000j)],
        [(0.0000 + 0.0000j), (0.0000 + 0.0000j)]], dtype=torch.complex64)
```
and now yields
```
tensor([[(0 + 0.j), (0 + 0.j)],
        [(0 + 0.j), (0 + 0.j)]], dtype=torch.complex64)
```
This new print version is more consistent with float tensor's pretty print.

The following example mixes integer and decimals:
```
print(torch.tensor([[1 + 1.340j, 3 + 4j], [1.2 + 1.340j, 6.5 + 7j]], dtype=torch.complex64))
```
This yields:
```
tensor([[                     (1.0000 + 1.3400j),
                              (3.0000 + 4.0000j)],
        [                     (1.2000 + 1.3400j),
                              (6.5000 + 7.0000j)]], dtype=torch.complex64)
```

The following example
```
torch.tensor([1,2,3,4.5])
```
yields
```
tensor([1.0000, 2.0000, 3.0000, 4.5000]) .
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35841

Differential Revision: D20893848

Pulled By: anjali411

fbshipit-source-id: f84c533b8957a1563602439c07e60efbc79691bc
2020-04-09 08:54:25 -07:00
2b30e7fe11 Move inplace view tests to generic testing framework (#36281)
Summary:
So that all these tests run on CUDA as well.
This PR is preparation for https://github.com/pytorch/pytorch/pull/36073
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36281

Differential Revision: D20931467

Pulled By: ailzhang

fbshipit-source-id: e70c2c1981d9557c4b7ed5e0bd85345e298bf63c
2020-04-09 08:48:01 -07:00
ddf5755ff8 Fix DDP error checking for unused parameters (#36054)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/36054

Test Plan: Imported from OSS

Differential Revision: D20865498

Pulled By: mrshenli

fbshipit-source-id: 6dbb7b9b1d1ace3a8a619431330c260bfc43cefd
2020-04-09 08:06:25 -07:00
7403545518 Fix exception message of torch.optim.AdamW. (#36088)
Summary:
PyTorch does not implement `SparseAdamW`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36088

Differential Revision: D20932357

Pulled By: gchanan

fbshipit-source-id: 49e5b72c34ff8ce0deb6b3807662b8b7d67d959f
2020-04-09 08:02:10 -07:00
075b732f26 doc fix for KLDivLoss (#36137)
Summary:
Fixes doc for KLDivLoss as per [this comment](https://github.com/pytorch/pytorch/pull/34586#discussion_r404442741).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36137

Differential Revision: D20932395

Pulled By: gchanan

fbshipit-source-id: ecc395e6bc689fbf758e2cdca946049de8963856
2020-04-09 07:54:56 -07:00
5bbcddae3b Add at::Generator to IValue (#36231)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/36231

Differential Revision: D20923443

Pulled By: pbelevich

fbshipit-source-id: 0cc00a5c1f7bb2fb5525416291c4dd870d23a881
2020-04-09 06:57:32 -07:00
ea8e347135 Replace std::shared_ptr with c10::intrusive_ptr in at::Generator (#36230)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36230

To make `at::Generator` compatible with `IValue` this PR replaces `std::shared_ptr<c10::GeneratorImpl>` with `c10::intrusive_ptr<c10::GeneratorImpl>`

Differential Revision: D20923377

Pulled By: pbelevich

fbshipit-source-id: 3cb4214900023d863e5f2fe4ea63ec8aeb30936a
2020-04-09 06:55:54 -07:00
62f9312abd Revert D20783298: Fix naming of "strides" method in TensorType
Test Plan: revert-hammer

Differential Revision:
D20783298

Original commit changeset: 8fcc146284af

fbshipit-source-id: 30e3cb6d7a30d82048534d4d2e794b7e08ae01bb
2020-04-09 04:24:43 -07:00
7487b2a184 [caffe2][debuggability] add length checks to MergeMultiScalarFeatureTensors (#36248)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36248

Add basic index/length checks to MergeMultiScalarFeatureTensors to avoid segfaults.

But I don't really understand this op: what would cause this mismatch (see test plan) -- would iike to add it to the assertion description.

Reviewed By: houseroad

Differential Revision: D20912048

fbshipit-source-id: 29ef8c4bd261a48d64cbef6aa4f0306d7f058e71
2020-04-09 02:57:21 -07:00
5f25e98fc7 Use _sparse_coo_tensor_unsafe to shallow copy sparse tensors in accumulate_grad (#36292)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36292

As reported in https://github.com/pytorch/pytorch/issues/36120,
sparse_coo_tensor has some expensive checks and we were using that to shallow
copy a sparse tensor in AccumulateGrad. This can be avoided by using
_sparse_tensor_coo_unsafe since we're just reusing the indices and values from
a valid sparse tensor to shallow copy it.

Using the benchmark code mentioned in
https://github.com/pytorch/pytorch/issues/36120, these are the results:

1) 65.1 ms on master with this PR.
2) 127.5 ms for PyTorch 1.4
3) 916.5 ms on master without this patch.
ghstack-source-id: 101817209

Test Plan: waitforbuildbot

Differential Revision: D20935573

fbshipit-source-id: 4661bc779c06b47b5eb677e3fd4e192d1e3cba77
2020-04-09 00:20:32 -07:00
f59e646faa [rpc] Allow profiling in RPC to work with torchscript function invocations (#36275)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36275

Calling a TorchScript function from within RPC was added after initial
support for the profiler with RPC, hence, we were not recording torchscript
funtions invoked under RPC correctly. This diff passes the `RecordFunction` to
the `_invoke_torchscript..` calls similar to what is done for builtin and UDFs.

However, this is only a temporary solution. We will be removing the use of
`RecordFunction` as a standalone in the RPC code in
https://github.com/pytorch/pytorch/pull/35055. This diff is to unblock
recording of torchscript functions in the meantime.
ghstack-source-id: 101800134

Test Plan:
Added tests for calling a script function with builtin, sync, and
asyc. The output looks like below:

```
------  ---------------  ---------------  ---------------  ---------------  ---------------
> Name                                                                                                        Self CPU
total %  Self CPU total   CPU total %      CPU total        CPU time avg     Number of Calls
> ----------------------------------------------------------------------------------------------------------  ---------
------  ---------------  ---------------  ---------------  ---------------  ---------------
> rpc_sync#__torch__.torch.testing._internal.distributed.rpc.rpc_test.my_script_func(worker1 -> worker2)      99.92%
        1.056s           99.92%           1.056s           1.056s           1
> select                                                                                                      0.04%
        383.661us        0.04%            383.661us        95.915us         4
> fill_                                                                                                       0.02%
        210.966us        0.02%            210.966us        52.741us         4
> to                                                                                                          0.00%
        26.276us         0.00%            26.276us         26.276us         1
> empty                                                                                                       0.02%
        159.802us        0.02%            159.802us        79.901us         2
> set_                                                                                                        0.01%
        93.818us         0.01%            93.818us         93.818us         1
> ----------------------------------------------------------------------------------------------------------  ---------
------  ---------------  ---------------  ---------------  ---------------  ---------------
> Self CPU time total: 1.057s
```

Note that we use `torch.jit._qualified_name` to get the name of the script fn.

Differential Revision: D20930453

fbshipit-source-id: c6d940aa44fcd9dd8a1a29c156aa19e0d8428d60
2020-04-08 23:58:36 -07:00
3d199aab08 Updating submodules
Summary:
GitHub commits:

83fc90b3df
1910f8c0e3

Test Plan: n/a

Reviewed By: zpao

fbshipit-source-id: 1d08aaea71b4896e8d214a00abae03944901e748
2020-04-08 21:23:38 -07:00
dd36f8c21b [FBGEMM] Open sourcing fbgemm_fp16 ops (#36212)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36212

Pull Request resolved: https://github.com/pytorch/FBGEMM/pull/342

So that we can let vendor use this as a reference for fp16 emulated ops.

Will modify the dependent TARGETS and CMakefiles.

Test Plan:
```
buck test deeplearning/fbgemm:
```

Reviewed By: hyuen

Differential Revision: D20911460

fbshipit-source-id: bb8a43e13591f295727fe1ecc74eca4ca85ab5b8
2020-04-08 20:44:18 -07:00
291c910e85 [future] Re-land some safe portions of the future change. (#36254)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36254

These future use changes were all landed yesterday as part of the future
refactoring, quickly reverted due to an observed OOM, but now being relanded, since
they've since been tested to be benign.
ghstack-source-id: 101776613

Test Plan:
buck test mode/dev-nosan caffe2/test/...
   not ooming: buck run mode/opt -c=python.package_style=inplace //caffe2/torch/fb/training_toolkit/examples:ctr_mbl_feed_integration -- prod

Differential Revision: D20924010

fbshipit-source-id: 28872e488df34c7a886bcd659fa7e9914639d306
2020-04-08 20:05:33 -07:00
2458f6c63e Move all nccl from torch_python to torch_cuda (#36193)
Summary:
Because `torch_python` is supposed to be thin wrapper around `torch`

In this PR, all invocation of functions from nccl library are moved from python_nccl.cpp  (which is part of torch_python) to nccl.cpp (which is part of torch_cuda)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36193

Test Plan: CI

Differential Revision: D20930047

Pulled By: malfet

fbshipit-source-id: 7f278610077df6ac5dc3471c1a1b5d51e653ef9c
2020-04-08 18:01:47 -07:00
34a10238d5 fix is_float_scale_factor warning (c++) (#35601)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/35601

Differential Revision: D20925642

Pulled By: yf225

fbshipit-source-id: a4e1f953efce04b3f399a8e526fb6c055cc2971c
2020-04-08 17:52:09 -07:00
3a8838840b Add comparison operators to Vec256<BFloat16> (#36106)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/36106

Test Plan: Imported from OSS

Differential Revision: D20927638

Pulled By: ngimel

fbshipit-source-id: 6831747baab1af9d794011e2c7ae0291828dea2b
2020-04-08 17:03:36 -07:00
0bc17ddaa9 Use templates instead of macro when defining Vec256<BFloat16> bin operators (#35844)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35844

Also, bitwise operators can operate on the underlying __m256i
representation directly instead of making expensive conversions to
float16.

Test Plan: Imported from OSS

Differential Revision: D20927639

Pulled By: ngimel

fbshipit-source-id: 148c503df090580c8504f0df8d6ed2648d614120
2020-04-08 17:02:22 -07:00
0f34d648c8 Fix signed-unsigned warnings (RELAND) (#36224)
Summary:
This is a realand of https://github.com/pytorch/pytorch/pull/36196
Before the fix bazel spews following multi-line warning for every single caffe2 operator:
```
In file included from ./c10/util/logging_is_google_glog.h:50,
                 from ./c10/util/Logging.h:26,
                 from ./caffe2/core/logging.h:2,
                 from ./caffe2/core/blob.h:13,
                 from ./caffe2/core/operator.h:18,
                 from ./caffe2/sgd/adadelta_op.h:1,
                 from caffe2/sgd/adadelta_op.cc:1:
bazel-out/k8-fastbuild/bin/external/com_github_glog/_virtual_includes/glog/glog/logging.h: In instantiation of 'std::string* google::Check_LTImpl(const T1&, const T2&, const char*) [with T1 = int; T2 = long unsigned int; std::string = std::__cxx11::basic_string<char>]':
./caffe2/core/operator.h:192:5:   required from 'const T& caffe2::OperatorBase::Input(int, caffe2::DeviceType) [with T = caffe2::Tensor; caffe2::DeviceType = c10::DeviceType]'
./caffe2/core/operator.h:890:48:   required from 'const caffe2::Tensor& caffe2::Operator<Context>::Input(int, caffe2::DeviceType) [with Context = caffe2::CPUContext; caffe2::DeviceType = c10::DeviceType]'
./caffe2/sgd/adadelta_op.h:87:5:   required from 'bool caffe2::SparseAdadeltaOp<Context>::RunOnDevice() [with Context = caffe2::CPUContext]'
./caffe2/sgd/adadelta_op.h:85:8:   required from here
bazel-out/k8-fastbuild/bin/external/com_github_glog/_virtual_includes/glog/glog/logging.h:722:32: warning: comparison of integer expressions of different signedness: 'const int' and 'const long unsigned int' [-Wsign-compare]
  722 | DEFINE_CHECK_OP_IMPL(Check_LT, < )
      |                                ^
bazel-out/k8-fastbuild/bin/external/com_github_glog/_virtual_includes/glog/glog/logging.h:148:53: note: in definition of macro 'GOOGLE_PREDICT_TRUE'
  148 | #define GOOGLE_PREDICT_TRUE(x) (__builtin_expect(!!(x), 1))
      |                                                     ^
bazel-out/k8-fastbuild/bin/external/com_github_glog/_virtual_includes/glog/glog/logging.h:722:1: note: in expansion of macro 'DEFINE_CHECK_OP_IMPL'
  722 | DEFINE_CHECK_OP_IMPL(Check_LT, < )
      | ^~~~~~~~~~~~~~~~~~~~
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36224

Test Plan: CI

Differential Revision: D20919506

Pulled By: malfet

fbshipit-source-id: b8b4b7c62dcbc109b30165b19635a6ef30033e73
2020-04-08 16:29:27 -07:00
16980e455f Fix naming of "strides" method in TensorType (#35170)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35170

Looks like this was renamed by accident in 0cbd7fa46f2

Test Plan:
Unit test.

Imported from OSS

Differential Revision: D20783298

fbshipit-source-id: 8fcc146284af022ec1afe8d651baf6721b190ad3
2020-04-08 15:59:28 -07:00
6972c27d94 [quant] Enable fusion for conv modules with bias (#36173)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36173

Previously we were ignoring the conv bias during training if it existed
This PR adds the bias from the conv op during the conv+bn fusion process

Test Plan:
python test/quantization/test_quantization.py

Imported from OSS

Differential Revision: D20921613

fbshipit-source-id: eacb2ccf9107f413ac4ef23163ba914af9b90924
2020-04-08 15:53:32 -07:00
caa45c8e33 [TensorExpr] fix warnings (#36167)
Summary:
Fix a bunch of minor warnings in jit/tensorexpr, mostly unused variable & wrong sign comparisons.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36167

Differential Revision: D20905081

Pulled By: nickgg

fbshipit-source-id: 16fe605a86f08596f64e74e9337c59a2581a4d5a
2020-04-08 15:42:29 -07:00
76c7652cc5 Add distributed data parallel benchmark tool (#35198)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35198

The need for this tool was motivated by #28883. In the past, we have
done ad-hoc benchmarking, but it's time for something more structured.

It would be nice to add more model architectures so that we can get a
full picture of the performance impact of a code change simply by
running this suite a few times.

Test Plan: Imported from OSS

Differential Revision: D20591296

Pulled By: mrshenli

fbshipit-source-id: ee66ce0ebca02086453b02df0a94fde27ab4be49
2020-04-08 15:07:03 -07:00
4f3af09162 [JIT] Incremental updates to Alias Db in Mutation Remover pass (#35421)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35421

This PR makes it so that we don't have to rebuild the entire alias db each time we remove a node in alias analysis.

Test Plan: Imported from OSS

Differential Revision: D20922470

Pulled By: eellison

fbshipit-source-id: 9f43ed6dc743bf8a6b84a4aa38cff7059d46741d
2020-04-08 15:00:44 -07:00
4db87f4f97 [JIT] Allow mutated values as functional graph inputs (#33297)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33297

Allowing mutated values as inputs but not outputs has the effect of buffering up all mutated values as inputs to the graph. Just as we values which escape scope as graph inputs but not graph outputs - we should also allow values that get mutated. In both cases, the contract is that that the functional graph cannot write to graph inputs.

Without this patch, if there is a single write to the Tensor wildcard set it would disable all optimization.

Test Plan: Imported from OSS

Differential Revision: D20607175

Pulled By: eellison

fbshipit-source-id: c698e7cf3374e501cd5d835663991026a113ec6b
2020-04-08 14:59:26 -07:00
6016f694c0 Revert D20901746: [pytorch][PR] Update docs for master to remove Python 2 references
Test Plan: revert-hammer

Differential Revision:
D20901746

Original commit changeset: 07f8dc8e6fab

fbshipit-source-id: 13c55597f9f79b8473210cf35a5a0f1fb34bae39
2020-04-08 14:49:11 -07:00
83907ded1d Revert D20895316: [pytorch][PR] [JIT] List reland
Test Plan: revert-hammer

Differential Revision:
D20895316

Original commit changeset: 9a2bc0e6bdcb

fbshipit-source-id: d135f0038cf240a0973ecfcd540121cbd4ecb5a7
2020-04-08 14:40:10 -07:00
7c76c71616 [caffe2] remove quant options of SparseAdagrad from OSS (#35608)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35608

Various quantization options in SparseAdagradOp is only for experimental purposes and was unnecessarily complicating the code.
Moving these options back to internal code, merge into SparseSimdAdagradStochasticQuantOp, and change the name to SparseSimdAdagradFakeQuantOp

Test Plan: CI

Differential Revision: D20720426

fbshipit-source-id: 34c8fdea49f239c795f63e978ab13c8f535609d2
2020-04-08 14:33:23 -07:00
3be6a4db4d improve the quantized batch_norm performance (#35639)
Summary:
The original batch_norm performance is 2X slower than C2 for some shape, especially for the remaining channel size close to 32. For example, we have a total channel size 32*1 + 24. The 24 channel execution in original implementation will be slow.
Benchmark
```
import torch, time

for dtype in [torch.qint8, torch.quint8, torch.qint32]:
    print('****', str(dtype), '*****')
    x = torch.rand(1, 4, 56, 56, 24)

    q_x = torch.quantize_per_tensor(x, 0.5, 1, dtype)
    q_x = q_x.permute([0, 4, 1, 2, 3])
    c = 24
    mean = torch.rand(c).float()
    var = torch.rand(c).float()
    weight = torch.rand(c).float()
    bias = torch.rand(c).float()
    eps = 0.001

    x = x.permute([0, 4, 1, 2, 3])

    NITER = 10

    s = time.time()
    for i in range(NITER):
        float_out = torch.nn.functional.batch_norm(x, weight=weight, bias=bias, running_mean=mean, running_var=var, training=False, momentum=0, eps=eps)
        float_out = torch.nn.functional.relu(float_out)
    time_per_iter_float = (time.time() - s) / NITER

    s = time.time()
    for i in range(NITER):
        quant_out = torch.ops.quantized.batch_norm3d_relu(q_x, weight, bias, mean, var, eps, 0.5, 1)
    time_per_iter_quant = (time.time() - s) / NITER

    print('time/iter ms (float)', 'time/iter ms (quant)', 'quant/float', sep='\t')
    print(time_per_iter_float * 1000, time_per_iter_quant * 1000, time_per_iter_quant / time_per_iter_float, sep='\t')

```
```
**** torch.qint8 *****
time/iter ms (float)    time/iter ms (quant)    quant/float
0.6527423858642578      1.649641990661621       2.5272481554532837
**** torch.quint8 *****
time/iter ms (float)    time/iter ms (quant)    quant/float
0.5787134170532227      1.040959358215332       1.7987475796152104
**** torch.qint32 *****
time/iter ms (float)    time/iter ms (quant)    quant/float
0.5466938018798828      2.262735366821289       4.138944614042739
```

//Before the change:
```
**** torch.qint8 *****
time/iter ms (float)    time/iter ms (quant)    quant/float
0.7526159286499023      2.330636978149414      3.0967149238128426
**** torch.quint8 *****
time/iter ms (float)    time/iter ms (quant)    quant/float
0.21767616271972656     1.3946294784545898      6.406900328587075
**** torch.qint32 *****
time/iter ms (float)    time/iter ms (quant)    quant/float
0.24483203887939456     2.561521530151367       10.46236245009251
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35639

Differential Revision: D20723292

Pulled By: lly-zero-one

fbshipit-source-id: 66692eabaffb5030c2a37ec0f1322df3665411aa
2020-04-08 14:28:17 -07:00
82dd01150c Fix race during RPC shutdown. (#36113)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36113

As part of debugging https://github.com/pytorch/pytorch/issues/35863,
I discovered that the unit test would timeout during clean shutdown.

Looking into this further, it looks like there is a race in
`_on_leader_follower_report_shutdown_intent` when multiple followers call the
same method on the leader.

To fix this, I've ensured we have an appropriate lock in
`_on_leader_follower_report_shutdown_intent` to guard against this.

I ran the test 500 times to validate that this fix works.

Closes #35863
ghstack-source-id: 101641463

Test Plan:
1) waitforbuildbot
2) Ran the test 500 times.

Differential Revision: D20884373

fbshipit-source-id: 9d580e9892adffc0c9a4c2e832881fb291a1ff16
2020-04-08 14:12:33 -07:00
5910c51545 Exclude torch/csrc/cuda/*nccl* from clang-tidy (#36249)
Summary:
Since workflow configures pytorch with 'USE_NCCL` set to 0, we can not tidy those files
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36249

Differential Revision: D20926213

Pulled By: malfet

fbshipit-source-id: 69c051b7d22fb5f19147a7955782a7de5137f740
2020-04-08 14:06:00 -07:00
fab06bfb75 Add utility for bundling sample inputs with models (#35631)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35631

Bundling sample inputs with our models with a standardized interface
will make it possible to write benchmarking and code-coverage tools that
call all models in a uniform way.  The intent is to make this a standard
for mobile models within Facebook.  Putting it in torch/utils so tests
can run on GitHub and because it might be useful for others as well.

`augment_model_with_bundled_inputs` is the primary entry point.  See
its docstring for usage information and the test for some example uses.

One design question I had was how much power should be available for
automatic deflating and inflating of inputs.  The current scheme gives
some automatic handling and a reasonable escape hatch
("_bundled_input_inflate_format") for top-level tensor arguments, but no
automatic support for (e.g.) tensors in tuples or long strings.  For
more complex cases, we have the ultimate escape hatch of just defining
_generate_bundled_inputs in the model.

Another design question was whether to add the inputs to the model or
wrap the model in a wrapper module that had these methods and delegated
calls to `forward`.  Because models can have other exposed methods and
attributes, the wrapped seemed too onerous.

Test Plan: Unit test.

Differential Revision: D20925013

Pulled By: dreiss

fbshipit-source-id: 4dbbb4cce41e5752133b4ecdb05e1c92bac6b2d5
2020-04-08 13:10:36 -07:00
645d57ea01 Expose JIT Module's "register_attribute" to Python (#35630)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35630

Prefix underscored for now because the semantics of this method can be
confusing.  It adds a new attribute to the *type*, which can be shared
by several objects.

Test Plan:
Next diff in stack uses it, and has unit tests.

Imported from OSS

Differential Revision: D20904253

fbshipit-source-id: dcbf60eacf0e0e075c19238165aa33954aa73b5f
2020-04-08 13:09:28 -07:00
246416ac3b clang-tidy workflow only needs cuda-toolkit (#36241)
Summary:
`cuda` metapackage install both kernel driver + runtime libraries + toolchain, while `cuda-toolkit` metapackage as name suggests installs only toolchain + library headers.

This reduces install dependencies time for `clang-tidy` step by 60+ sec
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36241

Test Plan: CI

Differential Revision: D20923839

Pulled By: malfet

fbshipit-source-id: 1e773285222bed179973449573215fcaee1983de
2020-04-08 12:29:58 -07:00
9a2b505563 [JIT] Shape inference improvement (#35051)
Summary:
Support `aten::div` in `PropagateCompleteShapeOnNode`.

complete shape propagation on `aten::div` is disabled, because shape inference
relies on running node to propagate shape. For `aten::div` we run into
deviding-by-zero problem.

However, shape propagation for pointwise operatoins should be identical. We
would be able to swap the operation for `aten::div` with `aten::mul`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35051

Differential Revision: D20921359

Pulled By: eellison

fbshipit-source-id: 344371f34724a1b6bb2f853ebb4cef80423a4f9f
2020-04-08 12:28:40 -07:00
195362d74c [TensorExpr] scalar factorization of Div (#36154)
Summary:
Add support for the TensorExpr IR Simplifier to factorize common terms on either side of a Div node. e.g. `(8 * x) / (4 * y) => (2 * x) / y`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36154

Differential Revision: D20910580

Pulled By: nickgg

fbshipit-source-id: ee071d93bc4711b1e710be312de599d18ab506f3
2020-04-08 11:56:07 -07:00
a91535930f [future] Undo some recent torch::utils::Future api changes (#36220)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36220

The torch::utils::Future change from yesterday may have introduced a reference cycle,
leading to OOM on PS. This change reverts the lambda  capture changes with
torch::utils::Future until we can analyze further.
ghstack-source-id: 101756106

Test Plan: ctr mobile feed: buck run mode/opt -c=python.package_style=inplace //caffe2/torch/fb/training_toolkit/examples:ctr_mbl_feed_integration -- prod-preset

Differential Revision: D20918904

fbshipit-source-id: d637f2370aa72c1765b98f3b9e10eb969a025624
2020-04-08 11:28:22 -07:00
93256617c8 C++ Adam optimizer - corrected messages for check of default options (#36161)
Summary:
Modified messages in the check of default options for the Adam optimizer.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36161

Differential Revision: D20920140

Pulled By: yf225

fbshipit-source-id: e697ef1741d4dd86f7f18dc0be2c3b4bd3894d8f
2020-04-08 11:22:50 -07:00
ae71c5c7e6 Optimized bincount for the CPU by removing extra size() calls (#35822)
Summary:
By removing the calls of `size` that were effectively nops, I've managed to make `bincount_cpu` run around 6 times faster on my machine. EDIT: (Running Windows 10, I'm suspecting this may be a Windows-specific bug)

For histogramming 1e7 samples with 1e5 bins, best of 20 with 10 runs each
Before: 3.201189
After: 0.466188
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35822

Differential Revision: D20919885

Pulled By: ezyang

fbshipit-source-id: 1657056d69a02f1e61434f4cc8fa800f8d4e1fe8
2020-04-08 11:09:14 -07:00
e99c53dc86 Fix broadcast_coalesce for empty tensors (#35965)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/35470
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35965

Differential Revision: D20919377

Pulled By: ezyang

fbshipit-source-id: cfbcb35a44507de1c3ed7e0732cfc3b124b9bc0b
2020-04-08 11:02:11 -07:00
38849e119f [pytorch] Add error when PyTorch used with Python 2 (#36151)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36151

Python 2 has reached end-of-life and is no longer supported by PyTorch. To avoid confusing behavior when trying to use PyTorch with Python 2, detect this case early and fail with a clear message.  This commit covers `import torch` only and not C++  for now.

Test Plan: waitforsandcastle

Reviewed By: dreiss

Differential Revision: D20894381

fbshipit-source-id: a1073b7a648e07cf10cda5a99a2cf4eee5a89230
2020-04-08 10:40:27 -07:00
f99e6370dc fix build breakage of //sigrid/... (#36206)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36206

nothing wrong with the code, adding appropriate casts
to keep the compiler happy

Test Plan:
build //sigrid/...
tests in the same directory
buck test //glow/glow/tests/fakelowp/...

Reviewed By: jspark1105

Differential Revision: D20911279

fbshipit-source-id: 086ef028006a53048e1cfbe9dbc6c4bdd18fb259
2020-04-08 10:27:27 -07:00
07306406ce s/repo.continuum.io/repo.anaconda.com/ (#36233)
Summary:
Followup after  https://github.com/pytorch/pytorch/pull/36201

Per https://github.com/conda/conda/issues/6886  `repo.anaconda.com` should have been used since Feb 2019
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36233

Test Plan: CI

Differential Revision: D20920706

Pulled By: malfet

fbshipit-source-id: 1a9027e60df4de21111731d7fbda28c02846b417
2020-04-08 10:06:50 -07:00
9ada7abc18 [JIT] fix comprehension scope writes (#36105)
Summary:
In a comprehension like:
```
    def f()->int:
        i = 1
        x = [i for i in range(7)]
        return i
```
the variables inside the comprehension do not write to the function environment.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36105

Differential Revision: D20880699

Pulled By: eellison

fbshipit-source-id: 40af0f7470e0baeff7ef158cb461bf85c816d169
2020-04-08 10:00:45 -07:00
b9fc4358d6 Enabled debug symbol in test_cpp_api_parity tests by default. (#36209)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/36209

Differential Revision: D20920042

Pulled By: yf225

fbshipit-source-id: 7e8f5a54bdb90d4a01f59e1f68cf036bf3620293
2020-04-08 09:34:32 -07:00
b9260bdb7b Don't build deps for python setup.py egg_info (#36208)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/36207.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36208

Differential Revision: D20919649

Pulled By: ezyang

fbshipit-source-id: b5242a540181b29dba8987fb5f00332e1e81ca98
2020-04-08 09:02:01 -07:00
901bb3c350 Delete as_variable_ref (#36096)
Summary:
This PR closes https://github.com/pytorch/pytorch/issues/34895 and builds on work started by ayushtues in https://github.com/pytorch/pytorch/pull/35184
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36096

Reviewed By: zou3519

Differential Revision: D20893693

Pulled By: astaff

fbshipit-source-id: 13aac1feaef3bcf86f7a4cf92d26e7a1ae43a3b3
2020-04-08 08:57:01 -07:00
4c8e38c6d7 Minor doc improvement for code_analyzer (#36177)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36177

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Differential Revision: D20904241

Pulled By: ezyang

fbshipit-source-id: b13584dfdb1f852e451b1295c0d4cd4a7f53712f
2020-04-08 08:14:50 -07:00
5a03664fd5 Attempt to fix the pytorch_cpp_doc_push build by pinning breathe. (#36190)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/36190

Differential Revision: D20907613

Pulled By: gchanan

fbshipit-source-id: 1ae04e12c0920b4fe566604f4d1cab2773117532
2020-04-08 07:42:55 -07:00
83abd7ffbf Revert D20909696: [pytorch][PR] Fix signed-unsigned warnings
Test Plan: revert-hammer

Differential Revision:
D20909696

Original commit changeset: 16723355f473

fbshipit-source-id: e1cf6e9d42f852693549a94d7f5830196781f00e
2020-04-08 01:21:04 -07:00
6f8017bf07 Enable simple executor for FBCODE (#34748)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/34748

Differential Revision: D20909390

Pulled By: Krovatkin

fbshipit-source-id: b3d0c981825d362d3d4f9012ff8151ffc7a59671
2020-04-08 00:19:49 -07:00
f0bddd5e7a Fix clang-format broken by https://github.com/pytorch/pytorch/pull/33788 (#36203)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/36203

Test Plan: CI

Differential Revision: D20911028

Pulled By: malfet

fbshipit-source-id: 93af66ce35139700118efacb5cb6c68175cb66d5
2020-04-07 23:01:07 -07:00
c04232ae2b Back out "[reland] Skip OpenMP Thread when OMP_NUM_THREADS is 1" (#36198)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36198

Original commit changeset: 4476a810dfe7

With the previous diff, when user sets KMP_AFFINITY, it will be ignored when OMP_NUM_THREADS is 1. That could cause performance regression.

Test Plan: n/a

Reviewed By: ilia-cher

Differential Revision: D20909628

fbshipit-source-id: 5738f99aa61072337146257a68189d3d03ad39f7
2020-04-07 22:54:15 -07:00
25fe27981f Fix signed-unsigned warnings (#36196)
Summary:
Otherwise, while bazel spews following multi-line warning for every single caffe2 operator:
```
In file included from ./c10/util/logging_is_google_glog.h:50,
                 from ./c10/util/Logging.h:26,
                 from ./caffe2/core/logging.h:2,
                 from ./caffe2/core/blob.h:13,
                 from ./caffe2/core/operator.h:18,
                 from ./caffe2/sgd/adadelta_op.h:1,
                 from caffe2/sgd/adadelta_op.cc:1:
bazel-out/k8-fastbuild/bin/external/com_github_glog/_virtual_includes/glog/glog/logging.h: In instantiation of 'std::string* google::Check_LTImpl(const T1&, const T2&, const char*) [with T1 = int; T2 = long unsigned int; std::string = std::__cxx11::basic_string<char>]':
./caffe2/core/operator.h:192:5:   required from 'const T& caffe2::OperatorBase::Input(int, caffe2::DeviceType) [with T = caffe2::Tensor; caffe2::DeviceType = c10::DeviceType]'
./caffe2/core/operator.h:890:48:   required from 'const caffe2::Tensor& caffe2::Operator<Context>::Input(int, caffe2::DeviceType) [with Context = caffe2::CPUContext; caffe2::DeviceType = c10::DeviceType]'
./caffe2/sgd/adadelta_op.h:87:5:   required from 'bool caffe2::SparseAdadeltaOp<Context>::RunOnDevice() [with Context = caffe2::CPUContext]'
./caffe2/sgd/adadelta_op.h:85:8:   required from here
bazel-out/k8-fastbuild/bin/external/com_github_glog/_virtual_includes/glog/glog/logging.h:722:32: warning: comparison of integer expressions of different signedness: 'const int' and 'const long unsigned int' [-Wsign-compare]
  722 | DEFINE_CHECK_OP_IMPL(Check_LT, < )
      |                                ^
bazel-out/k8-fastbuild/bin/external/com_github_glog/_virtual_includes/glog/glog/logging.h:148:53: note: in definition of macro 'GOOGLE_PREDICT_TRUE'
  148 | #define GOOGLE_PREDICT_TRUE(x) (__builtin_expect(!!(x), 1))
      |                                                     ^
bazel-out/k8-fastbuild/bin/external/com_github_glog/_virtual_includes/glog/glog/logging.h:722:1: note: in expansion of macro 'DEFINE_CHECK_OP_IMPL'
  722 | DEFINE_CHECK_OP_IMPL(Check_LT, < )
      | ^~~~~~~~~~~~~~~~~~~~
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36196

Differential Revision: D20909696

Pulled By: malfet

fbshipit-source-id: 16723355f473379ba9da6d3c33bd561b9724800a
2020-04-07 21:31:01 -07:00
e2f9c668a2 Use repo.anaconda.com instead of repo.continuum.io (#36201)
Summary:
Per https://github.com/conda/conda/issues/6886  `repo.anaconda.com` should have been used since Feb 2019
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36201

Test Plan: CI

Differential Revision: D20910667

Pulled By: malfet

fbshipit-source-id: 3a191e2cae293e6f96dbb323853e84c07cd7aabc
2020-04-07 21:09:18 -07:00
986a8fdd6a Use counter instead of vector of futures in _parallel_run (#36159)
Summary:
This should be faster than allocating one mutex, flag and conditional variable per task.

Using `std::atomic<size_t>` to count remaing tasks is not sufficient,
because modification of remaining counter and signalling conditional variable must happen atomically,
otherwise `wait()` might get invoked after `notify_one()` was called.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36159

Test Plan: CI

Differential Revision: D20905411

Pulled By: malfet

fbshipit-source-id: facaf599693649c3f43edafc49f369e90d2f60de
2020-04-07 18:59:56 -07:00
fc5d658324 [rpc] allow ability to abort second call to RecvWork::wait() in ProcessGroupAgent::listenLoop (#36084)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36084

https://github.com/pytorch/pytorch/pull/30330 added support to abort the call to a `RecvWork` created by `recvAnysource`, but there is an additional call to `pg_->recv()` to actually get the tensor sent over the wire (the previous call is the preamble for the tensor). This adds support to be able to abort this call as well in `::shutdown()`, which can be used to avoid hangs during ungraceful shutdown.

Added an internal test case in `ProcessGroupAgentTest` to ensure that an appropriate error message is raised when this happens.
ghstack-source-id: 101689402

Test Plan:
Added test in ProcessGroupAgentTest. We also add a basic config that allows us to control whether to abort the call to `pg->recv()` and `pg->recvAnysource()` in `FailingWaitProcessGroupGloo`.

Run test binary:
```buck build mode/dev-nosan //caffe2/torch/fb/distributed/thriftRpcBackend/test:ProcessGroupAgentTest --keep-going
~/fbcode/buck-out/gen/caffe2/torch/fb/distributed/thriftRpcBackend/test/ProcessGroupAgentTest
```
P128567144

Differential Revision: D20632764

fbshipit-source-id: c0b3c391fd3e0ae711661ad99f309ee4d93f6582
2020-04-07 18:44:56 -07:00
4b916b6b75 Mark every frame with a unique id (#33788)
Summary:
This PR introduces frame ids that will allow us to associate profiling information with its corresponding run.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33788

Differential Revision: D20164897

Pulled By: Krovatkin

fbshipit-source-id: 8172ff9f4d188b339e2ff98a80bbe4a2b306a8aa
2020-04-07 17:52:06 -07:00
72b55fea6b [jit] Make torch::utils::Future and ivalue::future apis closer (#35849)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35849

This change harmonizes some aspects of the api.

- torch::utils::Future callback should have no args, like ivalue::future.
   Many of the lines of this change are related to fixing that up downstream.

   No args makes the api simpler to use, particularly since many/most of the
   downstream use cases ignore the passed-in args. It's simple enough to
   appropriately capture the future in the lambda if necessary.

 - Add error/hasError methods to ivalue::Future.
 - Use c10::optional underneath for error to ivalue::Future.
 - Change markCompleted(error) to setError(error) to ivalue::Future.
 - Add setValue(FutureError) version to torch::utils::Future

ghstack-source-id: 101684435

Test Plan: buck test mode/dev-nosan caffe2/test/...

Differential Revision: D20803251

fbshipit-source-id: e3d925287bd9a80d649843eef5f270163f448269
2020-04-07 17:05:35 -07:00
373dc7c8ef Group libraries in TOC and add PyTorch Elastic (#34928)
Summary:
Move XLA out of Notes and group with other libraries. Also adds link to PyTorch Elastic

![image](https://user-images.githubusercontent.com/8042156/76912125-f76d1080-686f-11ea-99d5-bb7be199adbd.png)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34928

Differential Revision: D20901732

Pulled By: jlin27

fbshipit-source-id: a5da915bb435a3aa8995d8bbe87f53ef79fd3ce6
2020-04-07 16:37:45 -07:00
2afe171538 [JIT] List reland (#36146)
Summary:
Relanding https://github.com/pytorch/pytorch/pull/33783
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36146

Differential Revision: D20895316

Pulled By: eellison

fbshipit-source-id: 9a2bc0e6bdcbd43f9abe51eadaa28f90bccafcc9
2020-04-07 16:18:30 -07:00
43234be525 Update docs for master to remove Python 2 references (#36114)
Summary:
Full details in task: https://our.intern.facebook.com/intern/tasks/?t=64776265

With pytroch 1.5+ we remove python2 support from PyTorch. All documentation under docs/ and on the pytorch.org website needs to remove Python 2 references.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36114

Differential Revision: D20901746

Pulled By: jlin27

fbshipit-source-id: 07f8dc8e6fab0b232e5048a63079cab0c433c85f
2020-04-07 16:13:18 -07:00
ebf743a63a Fix bazel-test linking issue (#36157)
Summary:
Move `src/GenerateI8Depthwise.cc` from `fbgemm_baze` to `fbgemm_avx2`, because bazel hides unsued functions across the libraries
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36157

Differential Revision: D20900955

Pulled By: malfet

fbshipit-source-id: a180c8b54ca39bc076f42f2a740fd3b7f20750dc
2020-04-07 13:57:40 -07:00
8afa001d89 Revert D20885968: [pytorch][PR] Enable backtrace with MSVC
Test Plan: revert-hammer

Differential Revision:
D20885968

Original commit changeset: 6ad3822af31e

fbshipit-source-id: 468199cae2178b17b7ff63114e274b6844eecb7f
2020-04-07 12:10:45 -07:00
c2901333f1 Updating submodules
Summary:
GitHub commits:

705c16caef
2f18250af6
4e89db8a8e
c97495c660

Test Plan: n/a

Reviewed By: zpao

fbshipit-source-id: fee6c68f6ef541d230716bc9be8e978e20958e2a
2020-04-07 11:28:55 -07:00
681ca45717 [ONNX] Export torch.inverse op (#35318)
Summary:
Added support for torch.inverse export as part of opset 12
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35318

Reviewed By: hl475

Differential Revision: D20865900

Pulled By: houseroad

fbshipit-source-id: be12b5194d21408cae24eec16e9e12377e8546ad
2020-04-07 10:48:33 -07:00
6bc8ffe824 [JIT] Optimize before inlining (#35562)
Summary:
Resubmit of https://github.com/pytorch/pytorch/pull/35424, only this time I run optimizations in the right order so the PR description is actually true.

This speeds up the inlining pass of FairSeq model from 180s -> 13s, and MaskRCNN model from 5s -> 1.5s.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35562

Differential Revision: D20738922

Pulled By: eellison

fbshipit-source-id: 1439cf9d1f0bc780e2d64a744694f8b3b7ba4b70
2020-04-07 09:42:26 -07:00
2b06d5adc6 Fix compilation errors for enabling Intel nextgen compiler (icx/icpx) (#35939)
Summary:
ICPX's aggressive inlining elude implicit instantiation of templates, cause linking error.

Signed-off-by: caozhong <zhong.z.cao@intel.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35939

Reviewed By: jianyuh

Differential Revision: D20887025

Pulled By: jspark1105

fbshipit-source-id: 0618634c63dd3145ef11196ca140e974bdddd940
2020-04-07 09:31:32 -07:00
444073efde Add GenerateI8Depthwise.cc to bazel build definition of fbgemm (#36144)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/36144

Differential Revision: D20894322

Pulled By: malfet

fbshipit-source-id: 306702d4c6e2c79fef2345b0befd101d0a9317bf
2020-04-07 09:24:38 -07:00
16d9bcd725 Fix test_avg_pool3d issue in pytorch_paralleltbb_linux_xenial_py3_6_gcc5_4_test (#36103)
Summary:
Fix parallel execution issue introduced by https://github.com/pytorch/pytorch/issues/35740
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36103

Test Plan: test_quantized.py

Differential Revision: D20879323

Pulled By: allwu

fbshipit-source-id: a2deaaf5c933cbef3096a399c19c44d28935bd69
2020-04-07 08:26:20 -07:00
7920a970c6 Don't statically link MKL multiple times on Linux (#36078)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36078

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Differential Revision: D20873442

Pulled By: ezyang

fbshipit-source-id: c576432b1016beb735dca0b9a8bebb752f764ca8
2020-04-07 08:14:16 -07:00
34b32ca914 Remove operator-> from at::Generator (#36027)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/36027

Differential Revision: D20856462

Pulled By: pbelevich

fbshipit-source-id: 156fc23d51d8125d41e96b36b3b1312f13040588
2020-04-07 08:07:07 -07:00
3328a2f903 Rename CPUGenerator to CPUGeneratorImpl and CUDAGenerator to CUDAGeneratorImpl (#36026)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/36026

Differential Revision: D20856458

Pulled By: pbelevich

fbshipit-source-id: 6d105593dca67640d508a4aebf7edf028d52af32
2020-04-07 08:05:23 -07:00
7e84a30ad6 Enable backtrace with MSVC (#36039)
Summary:
Make it possible to report the C++ exceptions in console.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36039

Differential Revision: D20885968

Pulled By: ezyang

fbshipit-source-id: 6ad3822af31e5a64c4a93f16627fbefb7750e1c8
2020-04-07 07:25:12 -07:00
b55dee9fe1 fix max_pool2d cuda version Dimension out of range issue(#36046) (#36095)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/36095

Test Plan: Imported from OSS

Differential Revision: D20876733

Pulled By: glaringlee

fbshipit-source-id: a2b92fd2dd0254c5443af469e3fb2faa2323e5c9
2020-04-07 01:12:00 -07:00
3e5d25fdfd Skips test_avg_pool3d_nhwc (#36130)
Summary:
See https://github.com/pytorch/pytorch/issues/36129.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36130

Differential Revision: D20888016

Pulled By: mruberry

fbshipit-source-id: 3738ac6c7f4370b03fd528c90414ba8a7944b3bb
2020-04-06 23:50:32 -07:00
70acc9c0f5 Skips test_qadd_scalar_relu (#36128)
Summary:
See https://github.com/pytorch/pytorch/issues/36127.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36128

Differential Revision: D20887918

Pulled By: mruberry

fbshipit-source-id: d3745f173ad713bb2847157df2890a1b7c18f0af
2020-04-06 23:18:57 -07:00
2e8f9547fa Updating submodules
Summary:
GitHub commits:

1f42be50b7

Test Plan: n/a

Reviewed By: zpao

fbshipit-source-id: ce850ccf3a4857edc09bd5f5086f9650f95830d8
2020-04-06 22:55:28 -07:00
447bcd341d Bazel build of pytorch with gating CI (#36011)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/36011

Differential Revision: D20873430

Pulled By: malfet

fbshipit-source-id: 8ffffd10ca0ff8bdab578a70a9b2b777aed985d0
2020-04-06 22:50:33 -07:00
64594d8333 Clang 9 and GCC 9 Support (#35835)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35835

Make compilable with Clang 9 and GCC 9.

Test Plan: Compile with Clang 9 and GCC 9

Differential Revision: D20800182

fbshipit-source-id: dd9474640270de0ad6392641513a7f2fa970d6e3
2020-04-06 21:14:43 -07:00
803a4e135e Fixes CMake lint error (#36123)
Summary:
```
Total Errors: 1
Ignoring file: aten/src/ATen/ATenConfig.cmake.in
caffe2/CMakeLists.txt:504: Extra spaces between 'if' and its () [whitespace/extra]
Ignoring file: cmake/Caffe2Config.cmake.in
Ignoring file: cmake/Caffe2ConfigVersion.cmake.in
```

Fixes that error.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36123

Differential Revision: D20886027

Pulled By: mruberry

fbshipit-source-id: 826a8b02bb128916e3b1634f3ff312cc36e100b5
2020-04-06 21:00:37 -07:00
449a4ca340 Add more alternative filters in places people forgot to add them. (#36082)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36082

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Differential Revision: D20874618

Pulled By: ezyang

fbshipit-source-id: b6f12100a247564428eb7272f803a03c9cad3a97
2020-04-06 20:29:24 -07:00
3570ef6a0f Revert D20876204: [pytorch][PR] Add trivial reduce for Cuda
Test Plan: revert-hammer

Differential Revision:
D20876204

Original commit changeset: a719f3583cc4

fbshipit-source-id: 6d00afb3a24754d283a7b832c0b784ed9fce36e1
2020-04-06 20:17:04 -07:00
459163b8eb Revert D20449887: [dt][caffe2] enable using smart exceptions in async nets
Test Plan: revert-hammer

Differential Revision:
D20449887

Original commit changeset: 047fdf1bd52f

fbshipit-source-id: 3d7801613f86885c204f3946f3a52a855516faa3
2020-04-06 19:37:05 -07:00
0f243688be Updating submodules
Summary:
GitHub commits:

545a6d3fe4
2f75edd34f
f53cdab3d7
1c00a2daaf
284e1c738b

Test Plan: n/a

Reviewed By: zpao

fbshipit-source-id: bf0582a3740d7e07ae2f009002acc0a8ea275917
2020-04-06 19:25:42 -07:00
2f1ca26abd Update NNPI Backend to v0.5.1.4 (#4334)
Summary:
Update of NNPI Backend to v0.5.1.4 branched out of commit 2aa0e5d3e108a0607acd3184dc803cfd77bd6c3c.
Pull Request resolved: https://github.com/pytorch/glow/pull/4334

Reviewed By: jfix71

Differential Revision: D20631651

Pulled By: arunm-git

fbshipit-source-id: e5d770c22ccccb753f13035d82c1e61951c256a5
2020-04-06 19:15:49 -07:00
4c140052a6 bfloat16: vectorized unary ops (#35092)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/35092

Test Plan: Imported from OSS

Differential Revision: D20721147

Pulled By: ngimel

fbshipit-source-id: 5d40eed36fd5c8b2d0d08bfb1b416fb608a5eaef
2020-04-06 18:52:39 -07:00
3da67ce367 [caffe2] Factor libtorch_python_sources into exposed definition (#36005)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36005

Getting ovrsource pytorch working, requires single source list across ovrsource and fbsource to avoid build failures every time the source list changes.
This diff factors out libtorch_python_sources into a separate function (needs to be function because it uses glob which is disallowed at global scope)

Test Plan: CI

Reviewed By: malfet

Differential Revision: D20852072

fbshipit-source-id: 0e8ae3f6605e090e3ffdd6aa227fac905e7d9877
2020-04-06 18:28:08 -07:00
6a45584272 Remove __nv_relfatbin section from nccl_static library (#35843)
Summary:
NCCL library is built using [CUDA separate compilation](https://devblogs.nvidia.com/separate-compilation-linking-cuda-device-code/), which consists of building intermediate CUDA binaries and then linking them into GPU code that could be executed on device. Intermediate CUDA code is stored in `__nv_relfatbin` section, and code that can be launched is stored in `.nv_fatbin`. When `nvcc` is used to link executable/shared library, it removes those intermediate binaries, but default host linker is not aware of that and therefore it is kept inside host executable.  Help compiler by removing `__nv_relfatbin` sections from object file inside `libncc_static.a`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35843

Test Plan: Build pytorch with CUDA and run `test_distributed.py`

Differential Revision: D20882224

Pulled By: malfet

fbshipit-source-id: f23dd4aa416518324cb38b9bd6846e73a1c7dd21
2020-04-06 18:23:08 -07:00
a81be33a4e Add trivial reduce for Cuda (#36092)
Summary:
Detect non-read-only loads, and not to use __ldg.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36092

Reviewed By: ZolotukhinM

Differential Revision: D20876204

Pulled By: zheng-xq

fbshipit-source-id: a719f3583cc4ca30fcfb49d999ca785181354d84
2020-04-06 17:58:50 -07:00
f421cf3978 update comments on fake operator (#36086)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36086

update comments

Test Plan:
internal tests pass

buck test //glow/fb/test/numerics/...

Reviewed By: yinghai

Differential Revision: D20875297

fbshipit-source-id: f0dde406c66ab6c9e5cb1c4f669f162486fefda0
2020-04-06 16:54:23 -07:00
5d33cf5dfc [Shape Inference] Set new shape according to precedence of dimType over previous value (#36081)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/36081

Reviewed By: yinghai, ipiszy

Differential Revision: D20873112

fbshipit-source-id: a610f989b9edb830097fda7502c04400ddfb42f1
2020-04-06 16:25:08 -07:00
4ced22c5de [JIT] Add IR Benchmarking tests to ai bench (#35732)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35732

Adding IR Complexity benchmarking tests to AI Bench. For full description of IR benchmarking look at https://github.com/pytorch/pytorch/pull/34918.

Test Plan:
https://our.intern.facebook.com/intern/testinfra/testconsole/testrun/5348024580589132/ test run

Local run:

PyTorchObserver {"type": "NET", "metric": "conv1d_profiled_num_ifs_loops", "unit": "scalar", "value": "0.0"}
PyTorchObserver {"type": "NET", "metric": "conv1d_profiled_num_non_tensor_nodes", "unit": "scalar", "value": "0.0"}
PyTorchObserver {"type": "NET", "metric": "conv2d_profiled_num_ifs_loops", "unit": "scalar", "value": "0.0"}
PyTorchObserver {"type": "NET", "metric": "conv2d_profiled_num_non_tensor_nodes", "unit": "scalar", "value": "0.0"}
PyTorchObserver {"type": "NET", "metric": "conv3d_profiled_num_ifs_loops", "unit": "scalar", "value": "0.0"}
PyTorchObserver {"type": "NET", "metric": "conv3d_profiled_num_non_tensor_nodes", "unit": "scalar", "value": "0.0"}
PyTorchObserver {"type": "NET", "metric": "conv_transpose1d_profiled_num_ifs_loops", "unit": "scalar", "value": "0.0"}
PyTorchObserver {"type": "NET", "metric": "conv_transpose1d_profiled_num_non_tensor_nodes", "unit": "scalar", "value": "0.0"}
PyTorchObserver {"type": "NET", "metric": "conv_transpose2d_profiled_num_ifs_loops", "unit": "scalar", "value": "0.0"}
PyTorchObserver {"type": "NET", "metric": "conv_transpose2d_profiled_num_non_tensor_nodes", "unit": "scalar", "value": "0.0"}
PyTorchObserver {"type": "NET", "metric": "conv_transpose3d_profiled_num_ifs_loops", "unit": "scalar", "value": "0.0"}
PyTorchObserver {"type": "NET", "metric": "conv_transpose3d_profiled_num_non_tensor_nodes", "unit": "scalar", "value": "0.0"}
PyTorchObserver {"type": "NET", "metric": "conv_tbc_profiled_num_ifs_loops", "unit": "scalar", "value": "0.0"}
PyTorchObserver {"type": "NET", "metric": "conv_tbc_profiled_num_non_tensor_nodes", "unit": "scalar", "value": "0.0"}
PyTorchObserver {"type": "NET", "metric": "avg_pool1d_profiled_num_ifs_loops", "unit": "scalar", "value": "0.0"}
PyTorchObserver {"type": "NET", "metric": "avg_pool1d_profiled_num_non_tensor_nodes", "unit": "scalar", "value": "0.0"}
PyTorchObserver {"type": "NET", "metric": "avg_pool2d_profiled_num_ifs_loops", "unit": "scalar", "value": "0.0"}
PyTorchObserver {"type": "NET", "metric": "avg_pool2d_profiled_num_non_tensor_nodes", "unit": "scalar", "value": "0.0"}
PyTorchObserver {"type": "NET", "metric": "avg_pool3d_profiled_num_ifs_loops", "unit": "scalar", "value": "0.0"}
PyTorchObserver {"type": "NET", "metric": "avg_pool3d_profiled_num_non_tensor_nodes", "unit": "scalar", "value": "0.0"}
I0403 15:39:29.672143 2015194 init.cc:579] Skip logging in unit test environment for event: torch.script.compile
PyTorchObserver {"type": "NET", "metric": "Linear_profiled_num_ifs_loops", "unit": "scalar", "value": "0.0"}
PyTorchObserver {"type": "NET", "metric": "Linear_profiled_num_non_tensor_nodes", "unit": "scalar", "value": "0.0"}
PyTorchObserver {"type": "NET", "metric": "Threshold_profiled_num_ifs_loops", "unit": "scalar", "value": "0.0"}
PyTorchObserver {"type": "NET", "metric": "Threshold_profiled_num_non_tensor_nodes", "unit": "scalar", "value": "0.0"}
PyTorchObserver {"type": "NET", "metric": "ReLU_profiled_num_ifs_loops", "unit": "scalar", "value": "0.0"}
PyTorchObserver {"type": "NET", "metric": "ReLU_profiled_num_non_tensor_nodes", "unit": "scalar", "value": "0.0"}
PyTorchObserver {"type": "NET", "metric": "ReLU6_profiled_num_ifs_loops", "unit": "scalar", "value": "0.0"}
PyTorchObserver {"type": "NET", "metric": "ReLU6_profiled_num_non_tensor_nodes", "unit": "scalar", "value": "0.0"}
PyTorchObserver {"type": "NET", "metric": "RReLU_profiled_num_ifs_loops", "unit": "scalar", "value": "0.0"}
PyTorchObserver {"type": "NET", "metric": "RReLU_profiled_num_non_tensor_nodes", "unit": "scalar", "value": "0.0"}
PyTorchObserver {"type": "NET", "metric": "Hardtanh_profiled_num_ifs_loops", "unit": "scalar", "value": "0.0"}
PyTorchObserver {"type": "NET", "metric": "Hardtanh_profiled_num_non_tensor_nodes", "unit": "scalar", "value": "0.0"}
PyTorchObserver {"type": "NET", "metric": "Sigmoid_profiled_num_ifs_loops", "unit": "scalar", "value": "0.0"}
PyTorchObserver {"type": "NET", "metric": "Sigmoid_profiled_num_non_tensor_nodes", "unit": "scalar", "value": "0.0"}

Reviewed By: driazati

Differential Revision: D20754238

fbshipit-source-id: 179240dee516647e5583b9fe47083c84241ddacc
2020-04-06 16:14:25 -07:00
40a45957a0 May fix TopKTypeConfig<at::Half> without an additional Bitfield specialization (#36077)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/36077

Test Plan: Imported from OSS

Differential Revision: D20872623

Pulled By: gchanan

fbshipit-source-id: 9363dfc22cc316fa9e845f8b479da7894976079f
2020-04-06 16:14:21 -07:00
2173746f64 Compile THCTensorTopK per dtype. (#36074)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36074

ROCm builds fail inconsistently on this file by timing out.

Test Plan: Imported from OSS

Differential Revision: D20872395

Pulled By: gchanan

fbshipit-source-id: 20d0890433b7290c36ed99bc7bfb73a93971ead1
2020-04-06 16:12:27 -07:00
7d1f06462c Fixing Potential TSAN issue with joining RPC helper threads (#36094)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36094

The condition variable waiting in the RPC retry thread must be notified after setting the atomic running to False. This will cause ensure the thread is joinable, and allow `rpc.shutdown` to function correctly
ghstack-source-id: 101538860

Test Plan: build bot

Differential Revision: D20854763

fbshipit-source-id: b92050712a1e6c31d4dd3b3d98f32ef8dee0f2f2
2020-04-06 15:56:06 -07:00
b8383b3d4c [WIP] Enable NNC's LLVM dependency in CI (#35564)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/35564

Differential Revision: D20848144

Pulled By: resistor

fbshipit-source-id: 992589447162766fbe8df0c696563511a2bb8e52
2020-04-06 15:54:35 -07:00
2ef1ace877 [rpc] call threadPool.waitWorkComplete after listenerThread.join() to fix (#35394)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35394

As above
ghstack-source-id: 101592571

Test Plan: Existing CI, no longer flaky

Differential Revision: D20632405

fbshipit-source-id: fbfd81470b3361371109af341f0db3ef8b3a415b
2020-04-06 15:18:30 -07:00
cc78914755 qactivation_benchmarks: small bug fix (#35731)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35731

Changes relu and relu6 to point to the functional implementations here.
The previous behavior tested the time to create the module, but didn't actually run the
function (I noticed this when adding the new input sizes and seeing
the measured time not change).

Test Plan:
run the benchmark, the time now changes as expected with input size for
these.

Imported from OSS

Differential Revision: D20875542

fbshipit-source-id: 3a6278a7a861437d613c1e30698a58175a8e8555
2020-04-06 15:02:33 -07:00
6405f26a02 add more quantized activation benchmarks and input sizes (#35729)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35729

* there were a few quantized activations which had implementations but not benchmarks, adds them
* adds the input sizes from `unary_tests.py` here, so we can compare fairly from fp to quantized implementations of activations

Test Plan:
```
python -m pt.qactivation_test
```

Imported from OSS

Differential Revision: D20875544

fbshipit-source-id: f55a66422233b96f0791c85b05476596d5d72b5d
2020-04-06 15:02:29 -07:00
b68c3827de add benchmark for quantized batchnorm (#35389)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35389

Adds a benchmark for quantized batchnorm, with the parameters
the same compared to floating point batchnorm benchmark.

Test Plan:
run benchmarks
https://gist.github.com/vkuzo/c49be58abdf0ff64797fab3936d0cb15

Imported from OSS

Differential Revision: D20875543

fbshipit-source-id: ced89fbe2d18168e92950d0b74ca638aba54cd96
2020-04-06 15:01:05 -07:00
8ef82fc2c9 [dt][caffe2] enable using smart exceptions in async nets (#34753)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34753

This improves support for exceptions and capturing stack traces in caffe2 async nets. We generally want to use exceptions everywhere we can in order to preserve stack information. It also makes the exception timestamp more accurate so multiple exceptions at the same time can be correctly ordered.

Test Plan: Updated the tests to use the new error semantics + adds a test to ensure the stack is correctly propagated through deferrable async scheduling.

Reviewed By: andrewwdye

Differential Revision: D20449887

fbshipit-source-id: 047fdf1bd52fd7c7c1f3fde77df9a27ed9e288e7
2020-04-06 14:27:07 -07:00
3228939f23 [JIT] Fix fake_range() (#36083)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/36083

Test Plan: Imported from OSS

Differential Revision: D20874745

Pulled By: jamesr66a

fbshipit-source-id: fc57defefbc8e9840b8d5bac89b4146179e00b06
2020-04-06 14:12:35 -07:00
3e402a5940 [ROCm] Enable BFloat16 type for add_out_sparse (#35978)
Summary:
Enables bfloat16 type for add_out of sparse tensors. Also enabled it for coalesce() which is used in unit test reference checking.

iotamudelta ezyang
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35978

Differential Revision: D20874142

Pulled By: ezyang

fbshipit-source-id: af8d2f4bc5f5cc3bb7f8cb1e3c688669ba3d13b9
2020-04-06 14:07:17 -07:00
cb385cb6d7 Pin Sphinx to 2.4.4 (take 2), fix docs CIs (#36072)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36072

Update to https://github.com/pytorch/pytorch/pull/36065/ which was
almost there

Test Plan: - Wait for CI

Differential Revision: D20871661

Pulled By: zou3519

fbshipit-source-id: 2bf5ce382e879aafd232700ff1c0d61fc17ea52d
2020-04-06 13:48:59 -07:00
0475d7b08d [JIT] optimize mutableType calls (#35474)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35474

I had previously tried to optimize getMutableTypePtr calls by not recursing through container types, but it turns out there are a few uses of container types which refine their contained elements.
This attempt was in #35301

Now I am optimizing calls by caching TypePtr -> Mutable TypePtr conversions. Now that we are doing caching none of the functions marked as const are really const anymore. Previously many of the const functions actually mutated internal state, such as rebuildWriteCache.

one kind of annoying thing is that there is a general api for querying mutability isMutableType that doesn't use the cache, and one internal that does, isMutableTypeInternal. It would be nice if I could call isMutableType within alias analysis and it would dispatch to the internal function, but I'm not sure how to do that.

getMutableTypePtr showed up as 12% of the first run of FairSeq, so this is a function worth optimizing.

Test Plan: Imported from OSS

Differential Revision: D20873493

Pulled By: eellison

fbshipit-source-id: 1b42bb58ba4142c118a6bc47a26978cd7fd0ac79
2020-04-06 13:31:51 -07:00
45fc881f05 [ROCm] Hotfix: Black list tensorexpr test set that has failures on ROCm (#36049)
Summary:
Test set got enabled with ROCm failures in https://github.com/pytorch/pytorch/pull/35914 - black list it for now.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36049

Differential Revision: D20869814

Pulled By: zou3519

fbshipit-source-id: fcdb2abc9f3407344b56cf8d48b7740008317020
2020-04-06 13:26:05 -07:00
59ed0c5fd7 Strip newline when ingesting version.txt (#36002)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/36002

Test Plan: Run cmake and observe there are no warning in stdout nor in `CMakeCache.txt`

Differential Revision: D20872854

Pulled By: malfet

fbshipit-source-id: 8a61b63b3d564e597e7a62dd913c97bc64b183b9
2020-04-06 13:21:10 -07:00
4ef383d5db add type hints on recently added ops to make them scriptable (#35885)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35885

For the ops I added recently, ensure all the typehints are
present, so that JIT can script them.

We might want to look into a test for this in the future.

Test Plan:
scripting works for all of them now:
https://gist.github.com/vkuzo/1d92fdea548ad596310fffcbe95e4438

Imported from OSS

Differential Revision: D20818431

fbshipit-source-id: 0de61eaf70c08d625128c6fffd05788e6e5bb920
2020-04-06 12:17:16 -07:00
8dba98da0f [ONNX] Added support for constant folding onnx::Add and onnx::Sub (#35869)
Summary:
Added support for constant folding onnx::Add and onnx::Sub
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35869

Reviewed By: hl475

Differential Revision: D20865640

Pulled By: houseroad

fbshipit-source-id: 2b8c1cc196959b5b5b9ce018dbdcb74d59a92d9f
2020-04-06 10:50:21 -07:00
d568c7d966 [TensorExpr] add more detail to malformed_input exceptions (#35891)
Summary:
Add an explanation string to malformed_input exceptions thrown inside jit/tensorexpr to aid in debugging issues.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35891

Differential Revision: D20822306

Pulled By: nickgg

fbshipit-source-id: ce153a05218f2a4da5ecf5f1a5dc439070c96e55
2020-04-06 10:36:31 -07:00
82d58ed484 disable the test to stop breaking the builds (#36053)
Summary:
allwu Leave it to you for further investigation and enable it back.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36053

Differential Revision: D20865286

Pulled By: lly-zero-one

fbshipit-source-id: b3e44b1343b66944aaa5a0a3909c8b5e9390c52f
2020-04-05 21:05:49 -07:00
e56ba8481e [ONNX] fix size for opset 11 (#35984)
Summary:
Fixing size, as the aten op has updated to support 0 inputs
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35984

Reviewed By: hl475

Differential Revision: D20858214

Pulled By: houseroad

fbshipit-source-id: 8ad0a0174a569455e89da6798eed403c8b162a47
2020-04-05 18:58:54 -07:00
8224398c14 [pytorch] Fix the extra_repr print message for float16 dynamic quantization (#36044)
Summary:
When applying the float16 dynamic quantization with
```
        model = torch.quantization.quantize_dynamic(
            model, {torch.nn.Linear}, dtype=torch.float16
        )
        print(model)
```
there is an issue when we try to print the model. Basically we cannot print the `qscheme` information for float16 weight (It is not per-tensor or per-channel quantization defined for int8 dynamic quantization).

Before this PR:
```
Traceback (most recent call last):
  File "dlrm_s_pytorch.py", line 860, in <module>
    print(dlrm)
  File "/home/jianyuhuang/miniconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1142, in __repr__
    mod_str = repr(module)
  File "/home/jianyuhuang/miniconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1142, in __repr__
    mod_str = repr(module)
  File "/home/jianyuhuang/miniconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1136, in __repr__
    extra_repr = self.extra_repr()
  File "/home/jianyuhuang/miniconda3/lib/python3.7/site-packages/torch/nn/quantized/dynamic/modules/linear.py", line 55, in extra_repr
    self.in_features, self.out_features, self.weight().qscheme()
RuntimeError: Could not run 'aten::qscheme' with arguments from the 'CPUTensorId' backend. 'aten::qscheme' is only available for these back
ends: [QuantizedCPUTensorId, VariableTensorId].
```

After this PR:
```
    (4): DynamicQuantizedLinear(
      in_features=2, out_features=1, dtype=torch.float16
      (_packed_params): LinearPackedParams()
    )
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36044

Differential Revision: D20860811

Pulled By: jianyuh

fbshipit-source-id: d1405a185f46a8110e6d27982b40534c854f4d1c
2020-04-05 14:27:42 -07:00
81c8ca1e2e Disable tracing for Pytorch Mobile client (#36007)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36007

Tracing is not needed in Pytorch Mobile client. Disabling it has a couple of benefits:
1. It's a pre-requisite to build lite interpreter.
2. It saves the code size for full jit and Federated learning (around 600k).

Solution: use PYTORCH_DISABLE_TRACING to disable it. It's better than passing an argument to code-gen because:
1. It's a single-point change in the code template for both VariableType and VariableFactories.
2. code-gen does not handle VariableTypeManual.cpp. The macro is need there anyway.
ghstack-source-id: 101529401

Test Plan: CI

Reviewed By: ljk53

Differential Revision: D20852558

fbshipit-source-id: c28cec9f90208974acfa351ec9aec3fabbbb8aac
2020-04-05 13:55:38 -07:00
66d50060eb Temporary methods for real and imag values of complex tensors (#35879)
Summary:
Notes:
1. didn't name them as _copy_real and _copy_imag because it's desirable (but not necessary) to have these methods as tensor methods.
2. replaced old .real() and .imag() instances with _copy_real() and _copy_imag() methods
3. didn't add documentation because we plan to remove these methods when we add real and imag as tensor attributes.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35879

Differential Revision: D20841760

Pulled By: anjali411

fbshipit-source-id: 7267e6fbaab9a5ce426e9396f12238994666b0dd
2020-04-05 07:22:02 -07:00
b3cdec88e3 Fix torch complex exp CPU implementation (#35532) (#35715)
Summary:
There was a permutation operation missing in each of the complex vector files. I also added some test cases, the last two of which fail under the current implementation. This PR fixes that: all the testcases pass.

Fixes https://github.com/pytorch/pytorch/issues/35532

dylanbespalko
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35715

Differential Revision: D20857024

Pulled By: anjali411

fbshipit-source-id: 4eecd8f0863faa838300951626f26b89e6cc9c6b
2020-04-04 15:33:32 -07:00
7ee88d61f7 Rename boxing/unboxing files and utilities (#35411)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35411

The file and class names in ATen/core/boxing were quite confusing.
Let's rename them for readability.

Also move function schema inference out of the boxing logic into op_registration.h where it belongs.
ghstack-source-id: 101539206

Test Plan: waitforsandcastle

Differential Revision: D20653621

fbshipit-source-id: 6a79c73d5758bee1e072d543c030913b18a69c7c
2020-04-04 14:13:28 -07:00
8a6173edf2 [caffe2] tune prefetch distance
Summary: As title

Test Plan:
Seeing prefetch distance 8 is slightly faster overall. Probably because prefetch distance 16 was tuned for int8 but int4 is a bit slower.

Comparing 4, 8, and 16

SKL T6
```
 bit_rate     4batch size     1  num rows         5000000   emb dim   112      avg length   100                         |  bit_rate     4batch size     1  num rows         5000000   emb dim   112      avg length   100                         |  bit_rate     4batch size     1  num rows         5000000   emb dim   112      avg length   100
  64 bit indices with prefetching, lengths_sum 100                                                                       |  64 bit indices with prefetching, lengths_sum 100                                                                       |  64 bit indices with prefetching, lengths_sum 100
              SLS    cache not flushed       prefetch 4     b/w   3.23747 GB/s     effective b/w:           3.4533GB/s   |              SLS    cache not flushed       prefetch 8     b/w   3.14268 GB/s     effective b/w:          3.35219GB/s   |              SLS    cache not flushed       prefetch 16     b/w   3.31473 GB/s     effective b/w:          3.53572GB/s
              SLS        cache flushed       prefetch 4     b/w  0.953971 GB/s     effective b/w:          1.01757GB/s   |              SLS        cache flushed       prefetch 8     b/w  0.908884 GB/s     effective b/w:         0.969477GB/s   |              SLS        cache flushed       prefetch 16     b/w  0.858664 GB/s     effective b/w:         0.915908GB/s
  bit_rate     4batch size     1  num rows         6000000   emb dim    24      avg length   100                         |  bit_rate     4batch size     1  num rows         6000000   emb dim    24      avg length   100                         |  bit_rate     4batch size     1  num rows         6000000   emb dim    24      avg length   100
  64 bit indices with prefetching, lengths_sum 100                                                                       |  64 bit indices with prefetching, lengths_sum 100                                                                       |  64 bit indices with prefetching, lengths_sum 100
              SLS    cache not flushed       prefetch 4     b/w  0.738689 GB/s     effective b/w:          2.95476GB/s   |              SLS    cache not flushed       prefetch 8     b/w  0.593186 GB/s     effective b/w:          2.37274GB/s   |              SLS    cache not flushed       prefetch 16     b/w  0.654879 GB/s     effective b/w:          2.61952GB/s
              SLS        cache flushed       prefetch 4     b/w  0.422531 GB/s     effective b/w:          1.69013GB/s   |              SLS        cache flushed       prefetch 8     b/w  0.525107 GB/s     effective b/w:          2.10043GB/s   |              SLS        cache flushed       prefetch 16     b/w  0.509311 GB/s     effective b/w:          2.03724GB/s
  bit_rate     4batch size     1  num rows         2000000   emb dim    20      avg length    50                         |  bit_rate     4batch size     1  num rows         2000000   emb dim    20      avg length    50                         |  bit_rate     4batch size     1  num rows         2000000   emb dim    20      avg length    50
  64 bit indices with prefetching, lengths_sum 50                                                                        |  64 bit indices with prefetching, lengths_sum 50                                                                        |  64 bit indices with prefetching, lengths_sum 50
              SLS    cache not flushed       prefetch 4     b/w   0.28347 GB/s     effective b/w:          1.29586GB/s   |              SLS    cache not flushed       prefetch 8     b/w  0.273341 GB/s     effective b/w:          1.24956GB/s   |              SLS    cache not flushed       prefetch 16     b/w  0.550877 GB/s     effective b/w:           2.5183GB/s
              SLS        cache flushed       prefetch 4     b/w  0.348034 GB/s     effective b/w:          1.59101GB/s   |              SLS        cache flushed       prefetch 8     b/w  0.359454 GB/s     effective b/w:          1.64322GB/s   |              SLS        cache flushed       prefetch 16     b/w  0.348086 GB/s     effective b/w:          1.59125GB/s
  bit_rate     4batch size     1  num rows         6000000   emb dim    24      avg length    50                         |  bit_rate     4batch size     1  num rows         6000000   emb dim    24      avg length    50                         |  bit_rate     4batch size     1  num rows         6000000   emb dim    24      avg length    50
  64 bit indices with prefetching, lengths_sum 50                                                                        |  64 bit indices with prefetching, lengths_sum 50                                                                        |  64 bit indices with prefetching, lengths_sum 50
              SLS    cache not flushed       prefetch 4     b/w  0.322958 GB/s     effective b/w:          1.29183GB/s   |              SLS    cache not flushed       prefetch 8     b/w  0.622423 GB/s     effective b/w:          2.48969GB/s   |              SLS    cache not flushed       prefetch 16     b/w  0.331167 GB/s     effective b/w:          1.32467GB/s
              SLS        cache flushed       prefetch 4     b/w  0.406938 GB/s     effective b/w:          1.62775GB/s   |              SLS        cache flushed       prefetch 8     b/w  0.548998 GB/s     effective b/w:          2.19599GB/s   |              SLS        cache flushed       prefetch 16     b/w   0.40833 GB/s     effective b/w:          1.63332GB/s
  bit_rate     4batch size     1  num rows         4000000   emb dim    68      avg length   100                         |  bit_rate     4batch size     1  num rows         4000000   emb dim    68      avg length   100                         |  bit_rate     4batch size     1  num rows         4000000   emb dim    68      avg length   100
  64 bit indices with prefetching, lengths_sum 100                                                                       |  64 bit indices with prefetching, lengths_sum 100                                                                       |  64 bit indices with prefetching, lengths_sum 100
              SLS    cache not flushed       prefetch 4     b/w   2.26016 GB/s     effective b/w:          3.80658GB/s   |              SLS    cache not flushed       prefetch 8     b/w   1.66055 GB/s     effective b/w:          2.79671GB/s   |              SLS    cache not flushed       prefetch 16     b/w   2.31538 GB/s     effective b/w:          3.89959GB/s
              SLS        cache flushed       prefetch 4     b/w  0.714837 GB/s     effective b/w:          1.20394GB/s   |              SLS        cache flushed       prefetch 8     b/w  0.659482 GB/s     effective b/w:          1.11071GB/s   |              SLS        cache flushed       prefetch 16     b/w  0.643239 GB/s     effective b/w:          1.08335GB/s
  bit_rate     4batch size    16  num rows        10000000   emb dim    28      avg length   100                         |  bit_rate     4batch size    16  num rows        10000000   emb dim    28      avg length   100                         |  bit_rate     4batch size    16  num rows        10000000   emb dim    28      avg length   100
  64 bit indices with prefetching, lengths_sum 1600                                                                      |  64 bit indices with prefetching, lengths_sum 1600                                                                      |  64 bit indices with prefetching, lengths_sum 1600
              SLS    cache not flushed       prefetch 4     b/w   2.33704 GB/s     effective b/w:          8.30946GB/s   |              SLS    cache not flushed       prefetch 8     b/w   2.53271 GB/s     effective b/w:          9.00521GB/s   |              SLS    cache not flushed       prefetch 16     b/w   2.27881 GB/s     effective b/w:          8.10242GB/s
              SLS        cache flushed       prefetch 4     b/w  0.594799 GB/s     effective b/w:          2.11484GB/s   |              SLS        cache flushed       prefetch 8     b/w  0.675113 GB/s     effective b/w:           2.4004GB/s   |              SLS        cache flushed       prefetch 16     b/w  0.681539 GB/s     effective b/w:          2.42325GB/s
  bit_rate     4batch size     1  num rows         2000000   emb dim    20      avg length   100                         |  bit_rate     4batch size     1  num rows         2000000   emb dim    20      avg length   100                         |  bit_rate     4batch size     1  num rows         2000000   emb dim    20      avg length   100
  64 bit indices with prefetching, lengths_sum 100                                                                       |  64 bit indices with prefetching, lengths_sum 100                                                                       |  64 bit indices with prefetching, lengths_sum 100
              SLS    cache not flushed       prefetch 4     b/w  0.564903 GB/s     effective b/w:          2.58242GB/s   |              SLS    cache not flushed       prefetch 8     b/w  0.416964 GB/s     effective b/w:          1.90612GB/s   |              SLS    cache not flushed       prefetch 16     b/w  0.544874 GB/s     effective b/w:          2.49085GB/s
              SLS        cache flushed       prefetch 4     b/w  0.342759 GB/s     effective b/w:           1.5669GB/s   |              SLS        cache flushed       prefetch 8     b/w  0.394844 GB/s     effective b/w:            1.805GB/s   |              SLS        cache flushed       prefetch 16     b/w  0.339888 GB/s     effective b/w:          1.55378GB/s
  bit_rate     4batch size    40  num rows         5000000   emb dim    12      avg length    40                         |  bit_rate     4batch size    40  num rows         5000000   emb dim    12      avg length    40                         |  bit_rate     4batch size    40  num rows         5000000   emb dim    12      avg length    40
  64 bit indices with prefetching, lengths_sum 1600                                                                      |  64 bit indices with prefetching, lengths_sum 1600                                                                      |  64 bit indices with prefetching, lengths_sum 1600
              SLS    cache not flushed       prefetch 4     b/w   1.72766 GB/s     effective b/w:           11.057GB/s   |              SLS    cache not flushed       prefetch 8     b/w   1.77403 GB/s     effective b/w:          11.3538GB/s   |              SLS    cache not flushed       prefetch 16     b/w   1.72637 GB/s     effective b/w:          11.0488GB/s
              SLS        cache flushed       prefetch 4     b/w  0.388679 GB/s     effective b/w:          2.48754GB/s   |              SLS        cache flushed       prefetch 8     b/w  0.440318 GB/s     effective b/w:          2.81803GB/s   |              SLS        cache flushed       prefetch 16     b/w   0.46335 GB/s     effective b/w:          2.96544GB/s
```

BDW T6
```
  bit_rate     4batch size     1  num rows         5000000   emb dim   112      avg length   100                         |  bit_rate     4batch size     1  num rows         5000000   emb dim   112      avg length   100                         |  bit_rate     4batch size     1  num rows         5000000   emb dim   112      avg length   100
  64 bit indices with prefetching, lengths_sum 100                                                                       |  64 bit indices with prefetching, lengths_sum 100                                                                       |  64 bit indices with prefetching, lengths_sum 100
              SLS    cache not flushed       prefetch 4     b/w  0.160859 GB/s     effective b/w:         0.171583GB/s   |              SLS    cache not flushed       prefetch 8     b/w  0.153157 GB/s     effective b/w:         0.163367GB/s   |              SLS    cache not flushed       prefetch 16     b/w   0.13472 GB/s     effective b/w:         0.143701GB/s
              SLS        cache flushed       prefetch 4     b/w  0.147863 GB/s     effective b/w:         0.157721GB/s   |              SLS        cache flushed       prefetch 8     b/w  0.127365 GB/s     effective b/w:         0.135856GB/s   |              SLS        cache flushed       prefetch 16     b/w  0.147118 GB/s     effective b/w:         0.156926GB/s
  bit_rate     4batch size     1  num rows         6000000   emb dim    24      avg length   100                         |  bit_rate     4batch size     1  num rows         6000000   emb dim    24      avg length   100                         |  bit_rate     4batch size     1  num rows         6000000   emb dim    24      avg length   100
  64 bit indices with prefetching, lengths_sum 100                                                                       |  64 bit indices with prefetching, lengths_sum 100                                                                       |  64 bit indices with prefetching, lengths_sum 100
              SLS    cache not flushed       prefetch 4     b/w  0.190173 GB/s     effective b/w:         0.760691GB/s   |              SLS    cache not flushed       prefetch 8     b/w  0.275278 GB/s     effective b/w:          1.10111GB/s   |              SLS    cache not flushed       prefetch 16     b/w  0.190026 GB/s     effective b/w:         0.760104GB/s
              SLS        cache flushed       prefetch 4     b/w  0.160147 GB/s     effective b/w:         0.640589GB/s   |              SLS        cache flushed       prefetch 8     b/w  0.188328 GB/s     effective b/w:         0.753313GB/s   |              SLS        cache flushed       prefetch 16     b/w  0.168198 GB/s     effective b/w:         0.672792GB/s
  bit_rate     4batch size     1  num rows         2000000   emb dim    20      avg length    50                         |  bit_rate     4batch size     1  num rows         2000000   emb dim    20      avg length    50                         |  bit_rate     4batch size     1  num rows         2000000   emb dim    20      avg length    50
  64 bit indices with prefetching, lengths_sum 50                                                                        |  64 bit indices with prefetching, lengths_sum 50                                                                        |  64 bit indices with prefetching, lengths_sum 50
              SLS    cache not flushed       prefetch 4     b/w  0.240228 GB/s     effective b/w:          1.09818GB/s   |              SLS    cache not flushed       prefetch 8     b/w  0.239071 GB/s     effective b/w:           1.0929GB/s   |              SLS    cache not flushed       prefetch 16     b/w  0.213773 GB/s     effective b/w:         0.977248GB/s
              SLS        cache flushed       prefetch 4     b/w  0.144547 GB/s     effective b/w:         0.660788GB/s   |              SLS        cache flushed       prefetch 8     b/w  0.143015 GB/s     effective b/w:         0.653782GB/s   |              SLS        cache flushed       prefetch 16     b/w  0.143064 GB/s     effective b/w:         0.654009GB/s
  bit_rate     4batch size     1  num rows         6000000   emb dim    24      avg length    50                         |  bit_rate     4batch size     1  num rows         6000000   emb dim    24      avg length    50                         |  bit_rate     4batch size     1  num rows         6000000   emb dim    24      avg length    50
  64 bit indices with prefetching, lengths_sum 50                                                                        |  64 bit indices with prefetching, lengths_sum 50                                                                        |  64 bit indices with prefetching, lengths_sum 50
              SLS    cache not flushed       prefetch 4     b/w  0.271859 GB/s     effective b/w:          1.08744GB/s   |              SLS    cache not flushed       prefetch 8     b/w  0.236553 GB/s     effective b/w:         0.946214GB/s   |              SLS    cache not flushed       prefetch 16     b/w  0.271416 GB/s     effective b/w:          1.08567GB/s
              SLS        cache flushed       prefetch 4     b/w  0.175296 GB/s     effective b/w:         0.701185GB/s   |              SLS        cache flushed       prefetch 8     b/w  0.172102 GB/s     effective b/w:         0.688409GB/s   |              SLS        cache flushed       prefetch 16     b/w  0.167294 GB/s     effective b/w:         0.669176GB/s
  bit_rate     4batch size     1  num rows         4000000   emb dim    68      avg length   100                         |  bit_rate     4batch size     1  num rows         4000000   emb dim    68      avg length   100                         |  bit_rate     4batch size     1  num rows         4000000   emb dim    68      avg length   100
  64 bit indices with prefetching, lengths_sum 100                                                                       |  64 bit indices with prefetching, lengths_sum 100                                                                       |  64 bit indices with prefetching, lengths_sum 100
              SLS    cache not flushed       prefetch 4     b/w  0.169403 GB/s     effective b/w:         0.285311GB/s   |              SLS    cache not flushed       prefetch 8     b/w  0.180288 GB/s     effective b/w:         0.303643GB/s   |              SLS    cache not flushed       prefetch 16     b/w  0.177599 GB/s     effective b/w:         0.299114GB/s
              SLS        cache flushed       prefetch 4     b/w  0.126442 GB/s     effective b/w:         0.212955GB/s   |              SLS        cache flushed       prefetch 8     b/w  0.134837 GB/s     effective b/w:         0.227094GB/s   |              SLS        cache flushed       prefetch 16     b/w  0.130753 GB/s     effective b/w:         0.220216GB/s
  bit_rate     4batch size    16  num rows        10000000   emb dim    28      avg length   100                         |  bit_rate     4batch size    16  num rows        10000000   emb dim    28      avg length   100                         |  bit_rate     4batch size    16  num rows        10000000   emb dim    28      avg length   100
  64 bit indices with prefetching, lengths_sum 1600                                                                      |  64 bit indices with prefetching, lengths_sum 1600                                                                      |  64 bit indices with prefetching, lengths_sum 1600
              SLS    cache not flushed       prefetch 4     b/w  0.152165 GB/s     effective b/w:         0.541032GB/s   |              SLS    cache not flushed       prefetch 8     b/w  0.153995 GB/s     effective b/w:         0.547538GB/s   |              SLS    cache not flushed       prefetch 16     b/w  0.157132 GB/s     effective b/w:         0.558692GB/s
              SLS        cache flushed       prefetch 4     b/w  0.144367 GB/s     effective b/w:         0.513307GB/s   |              SLS        cache flushed       prefetch 8     b/w  0.151356 GB/s     effective b/w:         0.538156GB/s   |              SLS        cache flushed       prefetch 16     b/w  0.152418 GB/s     effective b/w:         0.541932GB/s
  bit_rate     4batch size     1  num rows         2000000   emb dim    20      avg length   100                         |  bit_rate     4batch size     1  num rows         2000000   emb dim    20      avg length   100                         |  bit_rate     4batch size     1  num rows         2000000   emb dim    20      avg length   100
  64 bit indices with prefetching, lengths_sum 100                                                                       |  64 bit indices with prefetching, lengths_sum 100                                                                       |  64 bit indices with prefetching, lengths_sum 100
              SLS    cache not flushed       prefetch 4     b/w  0.221365 GB/s     effective b/w:          1.01195GB/s   |              SLS    cache not flushed       prefetch 8     b/w  0.242639 GB/s     effective b/w:           1.1092GB/s   |              SLS    cache not flushed       prefetch 16     b/w   0.24001 GB/s     effective b/w:          1.09719GB/s
              SLS        cache flushed       prefetch 4     b/w  0.141794 GB/s     effective b/w:           0.6482GB/s   |              SLS        cache flushed       prefetch 8     b/w   0.13777 GB/s     effective b/w:         0.629803GB/s   |              SLS        cache flushed       prefetch 16     b/w  0.142048 GB/s     effective b/w:         0.649364GB/s
  bit_rate     4batch size    40  num rows         5000000   emb dim    12      avg length    40                         |  bit_rate     4batch size    40  num rows         5000000   emb dim    12      avg length    40                         |  bit_rate     4batch size    40  num rows         5000000   emb dim    12      avg length    40
  64 bit indices with prefetching, lengths_sum 1600                                                                      |  64 bit indices with prefetching, lengths_sum 1600                                                                      |  64 bit indices with prefetching, lengths_sum 1600
              SLS    cache not flushed       prefetch 4     b/w  0.175974 GB/s     effective b/w:          1.12623GB/s   |              SLS    cache not flushed       prefetch 8     b/w  0.157427 GB/s     effective b/w:          1.00754GB/s   |              SLS    cache not flushed       prefetch 16     b/w  0.175214 GB/s     effective b/w:          1.12137GB/s
              SLS        cache flushed       prefetch 4     b/w  0.160466 GB/s     effective b/w:          1.02699GB/s   |              SLS        cache flushed       prefetch 8     b/w  0.152678 GB/s     effective b/w:          0.97714GB/s   |              SLS        cache flushed       prefetch 16     b/w  0.168301 GB/s     effective b/w:          1.07712GB/s
```

Reviewed By: jianyuh

Differential Revision: D20799658

fbshipit-source-id: cd486d1bac56662de54960237d5fd8e3e9ba6822
2020-04-04 11:40:05 -07:00
7b2b17f727 Revert D20802884: [Shape Inference] Set new shape according to precedence of dimType over previous value
Test Plan: revert-hammer

Differential Revision:
D20802884

Original commit changeset: f8bdab5d5a8c

fbshipit-source-id: d44ffd273f2cdd582bc7306e7628ab09ab106ff2
2020-04-04 11:28:39 -07:00
82087ee7f6 Add DICT_CONSTRUCT and NAMED_TUPLE_CONSTRUCT to lite interpreter (#36015)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/36015

Test Plan: Imported from OSS

Reviewed By: linbinyu

Differential Revision: D20853995

Pulled By: iseeyuan

fbshipit-source-id: 153f76d223f9ffc71e2259b741a7e5d78ae63f22
2020-04-04 09:52:58 -07:00
5fab1bf3e4 Use std::abs instead of abs in lbfgs.cpp (#35974)
Summary:
This supersedes https://github.com/pytorch/pytorch/pull/35698.

`abs` is a C-style function that takes only integral argument
`std::abs` is polymorphic and can be applied to both integral and floating point types

This PR also increases `kBatchSize` in `test_optimizer_xor` function in `test/cpp/api/optim.cpp` to fix `OptimTest.XORConvergence_LBFGS` failure under ASAN.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35974

Test Plan: CI

Reviewed By: pbelevich

Differential Revision: D20853570

Pulled By: yf225

fbshipit-source-id: 6135588df2426c5b974e4e097b416955d1907bd4
2020-04-04 09:37:21 -07:00
e3e2dd7779 [Shape Inference] Set new shape according to precedence of dimType over previous value (#35910)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/35910

Reviewed By: iroot900

Differential Revision: D20802884

fbshipit-source-id: f8bdab5d5a8c7f71d564d18bd630425cb9f27c76
2020-04-03 23:30:37 -07:00
b7f4b6a6de Support for XNNPACK max pooling operator. (#35354)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/35354

Differential Revision: D20821862

Test Plan: Imported from OSS

Pulled By: AshkanAliabadi

fbshipit-source-id: 156fb8db85ab194919f68fd99599f08f2647b695
2020-04-03 22:53:15 -07:00
beb9430ff6 Propagate input tensor names in XNNPACK backend. (#35351)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/35351

Differential Revision: D20821861

Test Plan: Imported from OSS

Pulled By: AshkanAliabadi

fbshipit-source-id: 68bc50a0e87572f4d5388961ae83138852e69249
2020-04-03 22:51:38 -07:00
a604041a11 Back out "[pytorch][PR] indexing: throw exception for masks with dtype=uint8" (#36013)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36013

Original commit changeset: f4ebaabf427d

Test Plan: CI

Differential Revision: D20853694

fbshipit-source-id: 93deb43f67a385ddfd6853fef6f1dc6de408ec37
2020-04-03 21:40:02 -07:00
de04a1850f Remove nonexistent op variable in complex tests. (#35722)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/35722

Differential Revision: D20851653

Pulled By: EscapeZero

fbshipit-source-id: 42f23f4150d0bf501e0e03f0802dd3e3c5fa60f5
2020-04-03 19:57:55 -07:00
e5c6003f3e Mark prim::rpc_async as having side effect (#35994)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35994

prim::rpc_async was optimized out if we don't takes its returned future and wait on the future.

Test Plan: `

Differential Revision: D7850846

fbshipit-source-id: e4e46506ab608f2e072027d6c10c49a4d784b14a
2020-04-03 18:12:50 -07:00
e73ab30f3d rand() and uniform_() for complex dtype (#35585)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/35585

Test Plan: Imported from OSS

Differential Revision: D20820378

Pulled By: pbelevich

fbshipit-source-id: 761a274042ff44b46720339f34017974bf796e63
2020-04-03 18:05:24 -07:00
6be9c77998 Revert D20783179: [pytorch][PR] Bazel build of pytorch
Test Plan: revert-hammer

Differential Revision:
D20783179

Original commit changeset: b160908a3e10

fbshipit-source-id: 5b7b36305525e7ccc49540b48991149cf0a759f4
2020-04-03 17:59:10 -07:00
fced0c9837 Fix ATen/test/complex_test logic (#35976)
Summary:
Properly implement complex number arithmetic, for example `a / (b + c*i) = a * (b - c*i)/(b^2 + c^2)`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35976

Test Plan: CI

Differential Revision: D20851452

Pulled By: malfet

fbshipit-source-id: dd5d0fbc0b27c4ccfa66a8e75b97791188efc78a
2020-04-03 17:57:43 -07:00
eba5bdbeaa [pytorch] register c10 ops for static dispatch (#35193)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35193

PR #34275 / D20266240 caused size regression.
PR #35148 / D20578316 reverted partially to fix the regression.
With buck selective build landed it should no longer cause size regression. This diff relands the reverted part of the original diff.
ghstack-source-id: 100641910

Test Plan: CI

Differential Revision: D20586305

fbshipit-source-id: 6f314d6c13d1a557b314123a5ca350ab88441e95
2020-04-03 17:52:42 -07:00
585f153d00 Bazel build of pytorch (#35220)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/35220

Reviewed By: seemethere

Differential Revision: D20783179

Pulled By: malfet

fbshipit-source-id: b160908a3e107790fa06057a77de9d6d23493bbc
2020-04-03 17:13:58 -07:00
4b64dffcb6 Move uniform_() to DistributionTemplates(Migrate uniform_ from TH to ATen) (#35580)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35580

`uniform_kernel_cpu` is based on https://github.com/pytorch/pytorch/pull/30954

Test Plan: Imported from OSS

Differential Revision: D20820221

Pulled By: pbelevich

fbshipit-source-id: 13f9fc8fc75b0e9fb48021f2ac08dcb38212a53f
2020-04-03 16:37:44 -07:00
4d5fe90046 [rpc] replace tests on worker_name (#35955)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35955

bulk replace worker{} with worker_name()

Test Plan: Imported from OSS

Differential Revision: D20849012

Pulled By: wanchaol

fbshipit-source-id: 52ab1439c9dbfe814b576a97b0689331f1ff0274
2020-04-03 16:32:30 -07:00
ccfcf47531 Calls to Tensor::to pass MemoryFormat by TensorOptions (#34249)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34249

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Differential Revision: D20834164

Pulled By: bhosmer

fbshipit-source-id: 67586512df6b30869a8a77149fde6ff27beab81e
2020-04-03 16:28:17 -07:00
d559a47933 Enable relu fusion with prepacked linear/conv. (#35705)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35705

Introduces a pass for relu fusion.

Test Plan:
python test/test_xnnpack_integration.py

Imported from OSS

Differential Revision: D20746592

fbshipit-source-id: 6c22f60a20e9121618c85077b9b58fb8d4082b3b
2020-04-03 15:38:45 -07:00
eb42199788 third_party: bump fbgemm to 0bb23bf9 (#35988)
Summary:
Looks like the branch was force pushed, lets update this to a commit that exists

Signed-off-by: Eli Uriegas <eliuriegas@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35988

Differential Revision: D20849115

Pulled By: seemethere

fbshipit-source-id: 2f1202dcddef834d0b75a46e1202aa30b0176ac9
2020-04-03 15:33:46 -07:00
a054d05707 Add torch.utils.show_pickle for showing pickle contents in saved models (#35168)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35168

Sometimes when a saved model isn't working, it's nice to be able to look
at the contents of the pickle files.  Unfortunately, pickletools output
isn't particularly readable, and unpickling is often either not possible
or runs so much post-processing code that it's not possible to tell
exactly what is present in the pickled data.

This script uses a custom Unpickler to unpickle (almost) any data into
stub objects that have no dependency on torch or any other runtime types
and suppress (almost) any postprocessing code.

As a convenience, the wrapper can search through zip files, supporting
command lines like

`python -m torch.utils.show_pickle /path/to/model.pt1@*/data.pkl`

When the module is invoked as main, we also install a hack in pprint to
allow semi-resonable formatting of our stub objects.

Test Plan: Ran it on a data.pkl, constants.pkl, and a debug pkl

Differential Revision: D20842550

Pulled By: dreiss

fbshipit-source-id: ef662d8915fc5795039054d1f8fef2e1c51cf40a
2020-04-03 15:11:20 -07:00
ec3b355a0f Update ostream << TensorOptions printer. (#35892)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35892

A couple recent property additions were missing, plus we weren't
distinguishing between defaults and bona fide property values.

Test Plan: Imported from OSS

Differential Revision: D20834147

Pulled By: bhosmer

fbshipit-source-id: 26a7e433414e0cde1eee2a9a67472f03ba970897
2020-04-03 15:01:59 -07:00
03a4a4887d Fix clang-format (#35969)
Summary:
Just run `./tools/clang_format.py  --verbose` and `git commit --all`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35969

Test Plan: CI

Differential Revision: D20845626

Pulled By: malfet

fbshipit-source-id: 0ae9a91dfa33417a021e7e9d233baba4188daf81
2020-04-03 14:36:20 -07:00
71669f0249 Fix flake8 (#35968)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/35968

Pulled By: driazati

Differential Revision: D20845617

fbshipit-source-id: 1b1cedb9c5c721f7f7edf94b91fbbb97d249bc2a
2020-04-03 14:02:37 -07:00
7b04772c51 Keep same autogenerated files structure between fbcode and OSS builds (#35951)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35951

Change generate_code to keep folder structure the same regardless of whether install path is provide
Amend build_variables.bzl accordingly

Another preliminary step to merge https://github.com/pytorch/pytorch/pull/35220

Test Plan: CI

Reviewed By: EscapeZero, seemethere

Differential Revision: D20839410

fbshipit-source-id: 02297560a7e48aa7c6271f7a8517fc4a1ab35271
2020-04-03 12:28:07 -07:00
ba3cec867f Reenable test/test_tensorexpr.py (#35914)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/35914

Test Plan: Imported from OSS

Differential Revision: D20827188

Pulled By: ZolotukhinM

fbshipit-source-id: ffcc1bb0396a0a19afb577a7ab4ca95c7e4ced37
2020-04-03 12:20:31 -07:00
af5121f62a Invoke TensorExpr fuser pass from a graph executor. (#35913)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35913

The pass itself is still disabled by default, but with this change we
don't need to register it as a custom pass anymore. It allows us to
control its behavior with env variables more easily.

Test Plan: Imported from OSS

Reviewed By: suo

Differential Revision: D20827189

Pulled By: ZolotukhinM

fbshipit-source-id: e74d90b5e46422e7ab7bc40974a805220da50fbc
2020-04-03 12:20:26 -07:00
b3d30f2dc4 [TensorExpr] Compiler warnings cleanups. (#35925)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/35925

Test Plan: Imported from OSS

Differential Revision: D20830304

Pulled By: ZolotukhinM

fbshipit-source-id: 5ecd8fd403a3222385306a5295a199c86c88a6cc
2020-04-03 12:18:42 -07:00
b46fddf506 idtt + zch distributed inference (#35763)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35763

Adds inference function and test for ScatterAssign

Test Plan: Updated unit test

Reviewed By: yyetim, shunting1986

Differential Revision: D20501079

fbshipit-source-id: 7ec6ef0127a151250dd699c90c2b80c35cfb1fe4
2020-04-03 12:09:34 -07:00
596153cad1 [jit] Enable type tags in serialization (#35741)
Summary:
This enables the serialization part of this change (the deserialization stuff is already landed #33255)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35741

Pulled By: driazati

Differential Revision: D20758124

fbshipit-source-id: e2cdefa99c3bec991491e5e967e7f1661ca7ffd9
2020-04-03 11:59:42 -07:00
19bbfbe1cf [RPC][Better Engineering] Consolidated all rpcAgentRunning atomic booleans (#33915)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33915

Closes: https://github.com/pytorch/pytorch/issues/32963

Test Plan: build bot

Reviewed By: jjlilley

Differential Revision: D20074714

fbshipit-source-id: ee89e76f547a1da71825a317c096176524504290
2020-04-03 11:50:05 -07:00
c5c63a2e35 Add quick utility to transform scripted/traced models for mobile. (#35904)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35904

Currently this optimization means transform conv2d and linears to
prepacked(xnnpack) equivalent.

Test Plan: buck run fbsource//xplat/caffe2:optimize_for_mobile -- --model="/tmp/inpainting_fbnet.pt"

Reviewed By: AshkanAliabadi

Differential Revision: D20824433

fbshipit-source-id: 88d5c0d21b77911f95f018b03398b0df758ab0d7
2020-04-03 11:42:11 -07:00
f48008c261 Set eval mode during optimization for mobile. (#35903)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35903

Eval mode must be set for module freezing, which is required for prepack
folding.

Test Plan: Test locally by transforming a model. As shown in the diff above this one.

Reviewed By: AshkanAliabadi

Differential Revision: D20824420

fbshipit-source-id: 6c226f44cca317b0333fb580ebbfd060128ae919
2020-04-03 11:39:37 -07:00
6e13a7787b [jit] Fix type comparisons segfault (#35929)
Summary:
Pybind will convert `None`s to `nullptr`s, so this adds a check to make
sure those don't get into the actual type comparison logic. Fixes #35778
](https://our.intern.facebook.com/intern/diff/20831278/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35929

Pulled By: driazati

Differential Revision: D20831278

fbshipit-source-id: 5800050e5eec280072afde58141ad00c1e8db8e2
2020-04-03 11:33:48 -07:00
2fa3c1570d Refactor C++ API parity test mechanism and turn it on in CI again (#35190)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35190

The following are the main changes:
- The main logic of C++ API parity test mechanism is moved from `test/test_cpp_api_parity.py` to `test/cpp_api_parity/module_impl_check.py` and `test/cpp_api_parity/functional_impl_check.py`, so that there is a clear separation between module tests and functional tests, although they still share a lot of common utility functions which are all in `test/cpp_api_parity/utils.py`.
- Module init tests (i.e. testing whether C++ module accepts the same constructor options as the corresponding Python module) is removed and will be added again in the future.
- `cpp_constructor_args` / `cpp_options_args` / `cpp_function_call` are added as appropriate to all test params dict in `torch/testing/_internal/common_nn.py`, to indicate how to run C++ API parity test for this test params dict.

Test Plan: Imported from OSS

Differential Revision: D20588198

Pulled By: yf225

fbshipit-source-id: 11238c560c8247129584b9b49df73fff40c4d81d
2020-04-03 11:20:36 -07:00
2d8dbcd3ef Remove python2 and 3.5 from requirements.txt, README and docs (#35677)
Summary:
Some more cleanup now that we no longer support python2 or 3.5 on master and eventually PyTorch 1.6 release.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35677

Differential Revision: D20838097

Pulled By: orionr

fbshipit-source-id: 95d553a1e8769f3baa395e0bc6d4ce7cd93236e9
2020-04-03 11:05:43 -07:00
7468ef04c2 [quant][graphmode] Add quantize_per_tensor.tensors (#35916)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35916

quantize_per_tensor can now accept list of tensors.
Needed for operators like LSTM and cat

Test Plan: Imported from OSS

Differential Revision: D20830388

fbshipit-source-id: 73f81cf6b7c7614ef19a73b721bc57cf33211345
2020-04-03 10:42:59 -07:00
f0c747243c [quant][graphmode] Insert Observers for dynamic LSTM (#35894)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35894

Insert new TensorListObserver only for weight input of dynamic LSTM
This is because we are currently not observing the activation inputs in graph mode.
Activation tensors are dynamically quantized within the aten::qlinear_dynamic op

Test Plan:
python test/quantization/test_quantize_script.py

Imported from OSS

Differential Revision: D20830387

fbshipit-source-id: 81bd197ee509df41bd7622ed09fa3f199a37573b
2020-04-03 10:42:54 -07:00
0429d2c9b8 [quant][graphmode] Add new tensorlist observer for LSTM (#35893)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35893

LSTM operator inputs have tensor list for activations and weights.
In graph mode we need a new observer to work with tensor list

Test Plan:
python test/quantization/test_quantization.py ObserverTest

Imported from OSS

Differential Revision: D20830389

fbshipit-source-id: 4790f8932ae3d38446c1d942a2b3780aa91e3022
2020-04-03 10:41:28 -07:00
87582ae6c4 Make RRef type_hint mismatch exception message more actionable to users (#35943)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35943

This change will add message to tell why the concrete Module type is not a subtype of the Interface type, by telling the missing method name. For example, users may have forgot to tag that method with torch.jit.export.

Test Plan: `

Differential Revision: D7993693

fbshipit-source-id: 1a5b1d9ef483e5e120ab53c2427586560fbb9bcd
2020-04-03 10:25:09 -07:00
ea8021d726 Make intdiv_256 a more generic binary operator template (#35422)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35422

This would make `intdiv_256` a much more generic template that can easily accommodate other types of binary operators in the future. The operator becomes "out-of-place" because this would make it easier to substitute with other operators, and compilers should have no problem optimizing this.

Test Plan: Imported from OSS

Differential Revision: D20826861

Pulled By: ngimel

fbshipit-source-id: a6d0706cc1a585063426e988d9982bad402a9b36
2020-04-03 10:08:07 -07:00
a1cf3fd1da lshift and rshift on CUDA should match the behavior on CPU (#35339)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35339

CPU version converts integers to their unsigned version first. The CUDA
version should also do it.

Also added tests for this.

Test Plan: Imported from OSS

Differential Revision: D20826862

Pulled By: ngimel

fbshipit-source-id: 164c84cfd931d8c57177a038c1bb8b6f73134d07
2020-04-03 10:08:03 -07:00
beac3f27f0 Make intdiv_256 a more generic binary operator template (#35422)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35422

This would make `intdiv_256` a much more generic template that can easily accommodate other types of binary operators in the future. The operator becomes "out-of-place" because this would make it easier to substitute with other operators, and compilers should have no problem optimizing this.

Test Plan: Imported from OSS

Differential Revision: D20824641

Pulled By: ngimel

fbshipit-source-id: ec93f7b23eb7196f3791f4d07092ce12c254b6e0
2020-04-03 10:06:40 -07:00
e707cee501 Fix gcc-5.4 compilation (#35935)
Summary:
It needs a hint how to hash `enum class` in `std::unordered_map`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35935

Test Plan: CI

Differential Revision: D20837750

Pulled By: malfet

fbshipit-source-id: 4208ee4bfa2e3cfbedf5b92bf18031225bf9dfa1
2020-04-03 08:39:30 -07:00
be125d18dd [ROCm] [ROCm 2.10+] enable fp16 dot in Caffe2 backend (#30432)
Summary:
ROCm 2.10 has a hdot implementation, use it.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30432

Differential Revision: D20777482

Pulled By: ezyang

fbshipit-source-id: b4826cc399faa08bd83047375283b17bcd2477eb
2020-04-03 08:01:23 -07:00
8484ca581e Add GRAPH_UPDATE for x.size() in Peephole Optimize (#34865)
Summary:
Fix https://github.com/pytorch/pytorch/issues/31820
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34865

Reviewed By: jamesr66a

Differential Revision: D20772078

Pulled By: suo

fbshipit-source-id: cddf870e23983cc42da898edf3f98897353b2abe
2020-04-03 01:05:14 -07:00
aeb13f212b Make ValType hashable. (#35917)
Summary:
Build fix stemming from https://github.com/pytorch/pytorch/issues/34785
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35917

Differential Revision: D20829353

Pulled By: soumith

fbshipit-source-id: 4ba84ecedd354efbc9ac47c9b0f0e3871b404f13
2020-04-03 00:16:56 -07:00
1a146b0577 Vec256<bfloat16>::arange step size should accept templates. (#35842)
Summary:
See https://github.com/pytorch/pytorch/issues/34555
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35842

Differential Revision: D20824448

Pulled By: ngimel

fbshipit-source-id: d76d0d499cfd102386931e4c029829e02a657bce
2020-04-02 23:39:58 -07:00
a5af478f29 Use full include path in autogenerated Functions.cpp (#35924)
Summary:
Preliminary step to merge https://github.com/pytorch/pytorch/pull/35220
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35924

Test Plan: CI

Differential Revision: D20832159

Pulled By: malfet

fbshipit-source-id: 29ff2e3c04c08c39c49f35414f94b76f0651859a
2020-04-02 22:46:09 -07:00
d0ce94d20e Avoid one unnecessary memory allocation in XNNPACK integration. (#35350)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35350

Currently we call input.contiguous() on the input tensor resulting in an
unecessary allocation and copy in cases where the input is not contiguous
with regards to the requested memory format.  The reason is that in such
scenarios, this call re-allocates and copies the input tensor into
contiguous storage, only for this newly allocated tensor to be used as
the source of another copy to the final destination.  Instead, if we copy
into the destination directly in such circumstances, we will save an
allocation and a copy.

Differential Revision: D20656798

Test Plan: Imported from OSS

Pulled By: AshkanAliabadi

fbshipit-source-id: 3f8c51df4d1fd386fa9473e7024621a7b7c6e86c
2020-04-02 21:33:30 -07:00
c33ea41f9c Fixes a bug in serializing min/max plus one more. (#35850)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35850

1. Clamping values were not being propagated through all the structures
and hence were not being serialized.
2. Moved to using Scalar for min/max instead of float. Reason being, the
fusion for hardtanh_ does not work. During sub graph rewrite we direct
values from hardtanh_ to preacking ops, but since they expect float
values, the types conflict and we cannot serialize the model.

Test Plan: Imported from OSS

Differential Revision: D20807523

fbshipit-source-id: 57d6b2e4b65afd9510a0f3ba9365333b768977f5
2020-04-02 21:02:12 -07:00
e2adcc1c53 Report CUDA separate compilation flag (#35726)
Summary:
In Summary specify whether CUDA code is compiled with separate compilation enabled

Also, correctly handle space-separate TORCH_NVCC_FLAGS when adding them to NVCC_CUDA_FLAGS
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35726

Test Plan: CI + local build with TORCH_NVCC_FLAGS set to "-Xfatbin -compress-all"

Differential Revision: D20830885

Pulled By: malfet

fbshipit-source-id: 0e0ecab4a97b6c8662a2c4bfc817857da9f32201
2020-04-02 19:35:02 -07:00
767ea03b22 Clear profiling information timely and appropriately (#35814)
Summary:
Clear profiling information before it gets used by passes before guard insertion.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35814

Differential Revision: D20800599

Pulled By: Krovatkin

fbshipit-source-id: 978d71c22e1880dc888e7e75e7c25501c573333f
2020-04-02 19:14:56 -07:00
1a72326942 .circleci: Bump libtorch builds to 3.7 (#35912)
Summary:
The image is actually using Python 3.7.2 so we should reflect that
within our circleci configs

Should fix any issues related to `libtorch*gcc5_4` jobs.

Signed-off-by: Eli Uriegas <eliuriegas@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35912

Reviewed By: orionr

Differential Revision: D20827149

Pulled By: seemethere

fbshipit-source-id: 72917b35f6d176ce1f5bc999d6808b9f1d9944f2
2020-04-02 17:09:05 -07:00
591b5da2c8 Removes integer division call sites (#35862)
Summary:
Per title. Tests of integer division are unchanged.

The intent of this PR is to eliminate warning noise as users see our integer div deprecation warning and try to update their programs to be conformant. In particular, some CUDA indexing operations could perform a deprecated integer division, possibly confusing users.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35862

Differential Revision: D20817957

Pulled By: mruberry

fbshipit-source-id: b9fa15922c9bcea3cb08c0402ea2515feec137c9
2020-04-02 17:04:15 -07:00
ced1e46399 [PTM] register aten::dequantize.self for spark spot int8 model
Summary: aten::dequantize.self is the only missing op in spark spot int8 model

Test Plan: same as D20761873

Reviewed By: iseeyuan

Differential Revision: D20785654

fbshipit-source-id: 19a3394370af58012ed0dedcc458f3633d921527
2020-04-02 16:57:59 -07:00
f5b9574887 [easy] ThroughputBenchmark: make ScriptModuleBenchmark usable from c++ (#35848)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35848

This class so far was used from Python binding only. As a result, testing in c++ only environment is not currently possible. More specifically, adding inputs requires using
py::args and py::kwargs. This PR fixes this by adding another addInput function to ScriptModuleBenchmark class.

Test Plan: Imported from OSS

Differential Revision: D20820772

Pulled By: ilia-cher

fbshipit-source-id: f1ea1b7baa637b297cc0dec5ca6375f6caff21f5
2020-04-02 16:53:49 -07:00
762270c51f add c10d dynamic loading mechanism and unit test (#28068)
Summary:
The original behavior of pytorch c10d only supports built-in c10d backends, such as
nccl/gloo/mpi. This patch is used to extend the c10d capability to support dynamically
loading 3rd party communication libraries which are derived from ProcessGroup base class.

related RFC is in: https://github.com/pytorch/pytorch/issues/27955

Through this way, user just need specify a 3rd party c10d backend name when invoking
torch.distributed.init_process_group(). The proposed logic will try to load corresponding
c10d backend cpp extension automatically. as for how to develop a new 3rd party c10d backend
through cpp extension, pls refer to test/cpp_extensions/cpp_c10d_extension.cpp
Pull Request resolved: https://github.com/pytorch/pytorch/pull/28068

Differential Revision: D19174838

Pulled By: agolynski

fbshipit-source-id: 3409a504a43ce7260e6f9d1207c00e87471fac62
2020-04-02 15:46:51 -07:00
2a4ca70832 Fix contant/placeholder loading in checkGraphCompatibility
Summary: Properly load node inputs as placeholders during onnxifi checkGraphCompatibility only if they are non-weigth inputs to the node

Test Plan:
`buck test glow:`
  PASS: 2286
  FAIL: 0
  SKIP: 456

Reviewed By: jfix71

Differential Revision: D20823088

fbshipit-source-id: 76215b2c0c3934e36714201c7e716e8f95463e6d
2020-04-02 15:36:07 -07:00
2595c62208 [JIT] Better error on default params error (#35888)
Summary:
Someone messaged me abt this when a better error msg would have solved their problem
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35888

Differential Revision: D20819538

Pulled By: eellison

fbshipit-source-id: 95d124bfd162e1747dcdf7a981703a279a5dfaa6
2020-04-02 15:31:22 -07:00
c070e8fb26 Updated canCast to disallow complex -> non complex conversion (#35883)
Summary:
fixes https://github.com/pytorch/pytorch/issues/35675
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35883

Differential Revision: D20818130

Pulled By: anjali411

fbshipit-source-id: c9b4b6112897639d1e9b7073c5dac7a29b9cd990
2020-04-02 15:12:38 -07:00
dabeff33b9 [pytorch] Fix fblearner flow compiling errors (#35902)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35902

Move operator registration to anonymous namespace to avoid collision.

Reviewed By: soumith

Differential Revision: D20822382

fbshipit-source-id: 1ab00871491668b8b85e803ac877d96477f1688b
2020-04-02 14:52:48 -07:00
ddcad5b9ca temp disable test_tensorexpr.py
Test Plan: test on CI

Reviewed By: soumith

Differential Revision: D20823336

fbshipit-source-id: 65c04bc57c6a120003cb561613645d2d7e60189c
2020-04-02 14:28:22 -07:00
15b711a654 Fix reporting of error message in toBool (#35570)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/35570

Test Plan: Imported from OSS

Differential Revision: D20710690

Pulled By: zdevito

fbshipit-source-id: a83c687058c09943438f4c7e183754f931783fbc
2020-04-02 14:28:18 -07:00
bc7fdacf06 [BugFix] Fix compare_exchange_weak in DispatchStub.h (#35794)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35794

### Summary

As PyTorch has gone in production on iOS for about week, we've spotted a few crashes (90 out of 20.3k ) related to DispatchStub.h. The major part of the crash log is pasted below (full crash information can be found at `bunnylol logview 1d285dc9172c877b679d0f8539da58f0`):

```
FBCameraFramework void at::native::DispatchStub<void (*)(at::TensorIterator&, c10::Scalar), at::native::add_stub>::operator()<at::TensorIterator&, c10::Scalar&>(c10::DeviceType, at::TensorIterator&, c10::Scalar&)(DispatchStub.h:0)
+FBCameraFramework at::native::add(at::Tensor const&, at::Tensor const&, c10::Scalar)(BinaryOps.cpp:53)
+FBCameraFramework at::CPUType::add_Tensor(at::Tensor const&, at::Tensor const&, c10::Scalar)(CPUType.cpp:55)
+FBCameraFramework at::add(at::Tensor const&, at::Tensor const&, c10::Scalar)(Functions.h:1805)
+FBCameraFramework [inlined] c10::intrusive_ptr<c10::TensorImpl, c10::UndefinedTensorImpl>::intrusive_ptr(c10::intrusive_ptr<c10::TensorImpl, c10::UndefinedTensorImpl>&&)(intrusive_ptr.h:0)
+FBCameraFramework [inlined] c10::intrusive_ptr<c10::TensorImpl, c10::UndefinedTensorImpl>::intrusive_ptr(c10::intrusive_ptr<c10::TensorImpl, c10::UndefinedTensorImpl>&&)(intrusive_ptr.h:221)
+FBCameraFramework [inlined] at::Tensor::Tensor(at::Tensor&&)(TensorBody.h:93)
+FBCameraFramework [inlined] at::Tensor::Tensor(at::Tensor&&)(TensorBody.h:93)
+FBCameraFramework c10::detail::WrapRuntimeKernelFunctor_<(anonymous namespace)::$_3, at::Tensor, c10::guts::typelist::typelist<at::Tensor, at::Tensor, c10::Scalar> >::operator()(at::Tensor, at::Tensor, c10::Scalar)(kernel_lambda.h:23)
+FBCameraFramework [inlined] c10::guts::infer_function_traits<c10::detail::WrapRuntimeKernelFunctor_<(anonymous namespace)::$_3, at::Tensor, c10::guts::typelist::typelist<at::Tensor, at::Tensor, c10::Scalar> > >::type::return_type c10::detail::call_functor_with_args_from_stack_<c10::detail::WrapRuntimeKernelFunctor_<(anonymous namespace)::$_3, at::Tensor, c10::guts::typelist::typelist<at::Tensor, at::Tensor, c10::Scalar> >, false, 0ul, 1ul, 2ul>(c10::detail::WrapRuntimeKernelFunctor_<(anonymous namespace)::$_3, at::Tensor, c10::guts::typelist::typelist<at::Tensor, at::Tensor, c10::Scalar> >*, std::__1::vector<c10::IValue, c10::detail::WrapRuntimeKernelFunctor_<(anonymous namespace)::$_3, at::Tensor, c10::guts::typelist::typelist<at::Tensor, at::Tensor, c10::Scalar> >*::allocator<std::__1::vector> >*, c10::detail::WrapRuntimeKernelFunctor_<(anonymous namespace)::$_3, at::Tensor, c10::guts::typelist::typelist<at::Tensor, at::Tensor, c10::Scalar> >*::integer_sequence<unsigned long, 0ul, 1ul, 2ul>)(kernel_functor.h:210)
+FBCameraFramework [inlined] c10::guts::infer_function_traits<c10::detail::WrapRuntimeKernelFunctor_<(anonymous namespace)::$_3, at::Tensor, c10::guts::typelist::typelist<at::Tensor, at::Tensor, c10::Scalar> > >::type::return_type c10::detail::call_functor_with_args_from_stack<c10::detail::WrapRuntimeKernelFunctor_<(anonymous namespace)::$_3, at::Tensor, c10::guts::typelist::typelist<at::Tensor, at::Tensor, c10::Scalar> >, false>(c10::detail::WrapRuntimeKernelFunctor_<(anonymous namespace)::$_3, at::Tensor, c10::guts::typelist::typelist<at::Tensor, at::Tensor, c10::Scalar> >*, std::__1::vector<c10::IValue, c10::detail::WrapRuntimeKernelFunctor_<(anonymous namespace)::$_3, at::Tensor, c10::guts::typelist::typelist<at::Tensor, at::Tensor, c10::Scalar> >*::allocator<std::__1::vector> >*)(kernel_functor.h:218)
+FBCameraFramework c10::detail::make_boxed_from_unboxed_functor<c10::detail::WrapRuntimeKernelFunctor_<(anonymous namespace)::$_3, at::Tensor, c10::guts::typelist::typelist<at::Tensor, at::Tensor, c10::Scalar> >, false, void>::call(c10::OperatorKernel*, c10::OperatorHandle const&, std::__1::vector<c10::IValue, std::__1::allocator<c10::IValue> >*)(kernel_functor.h:250)
+FBCameraFramework [inlined] (anonymous namespace)::variable_fallback_kernel(c10::OperatorHandle const&, std::__1::vector<c10::IValue, std::__1::allocator<c10::IValue> >*)(VariableFallbackKernel.cpp:32)
+FBCameraFramework void c10::KernelFunction::make_boxed_function<&((anonymous namespace)::variable_fallback_kernel(c10::OperatorHandle const&, std::__1::vector<c10::IValue, std::__1::allocator<c10::IValue> >*))>(c10::OperatorKernel*, c10::OperatorHandle const&, std::__1::vector<c10::IValue, std::__1::allocator<c10::IValue> >*)(KernelFunction_impl.h:21)
+FBCameraFramework torch::jit::mobile::InterpreterState::run(std::__1::vector<c10::IValue, std::__1::allocator<c10::IValue> >&)(interpreter.cpp:0)
+FBCameraFramework torch::jit::mobile::Function::run(std::__1::vector<c10::IValue, std::__1::allocator<c10::IValue> >&) const(function.cpp:59)
+FBCameraFramework torch::jit::mobile::Module::run_method(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, std::__1::vector<c10::IValue, std::__1::allocator<c10::IValue> >)(module.cpp:51)
+FBCameraFramework [inlined] torch::jit::mobile::Module::forward(std::__1::vector<c10::IValue, std::__1::allocator<c10::IValue> >)(module.h:28)
```
The problem is `compare_exchange_weak` is not guaranteed to be successful in one shot, as described in  [C++ Concurrency in Action (2nd Edition)](https://livebook.manning.com/book/c-plus-plus-concurrency-in-action-second-edition/chapter-5/79). This might result in `cpu_dispatch_ptr` being null pointer in concurrent situations, thus leading to the crash. As suggested in the book, due to spurious failure, the `compare_exchange_weak` is typically used in a loop.  There is also a [stackoverflow discussion](https://stackoverflow.com/questions/25199838/understanding-stdatomiccompare-exchange-weak-in-c11) about this. Feel free to drop comments below if there is a better option.

### The original PR

- [Enhance DispatchStub to be thread safe from a TSAN point of view](https://github.com/pytorch/pytorch/pull/32148)

### Test Plan

- Keep observing the crash reports in QE

Test Plan: Imported from OSS

Differential Revision: D20808751

Pulled By: xta0

fbshipit-source-id: 52f5c865b70c59b332ef9f0865315e76d97f6eaa
2020-04-02 14:26:48 -07:00
d9dd353a00 fix clang-format (#35884)
Summary:
breakage introduced in PR that I landed
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35884

Differential Revision: D20817603

Pulled By: soumith

fbshipit-source-id: b0729bed81549d4c8e6a889c380baa19c73ef127
2020-04-02 12:12:27 -07:00
09660896c0 Break circular dependency between ATen.h and TensorIndexing.h (#35765)
Summary:
This is mostly just so VS Code will stop yelling at me.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35765

Differential Revision: D20787435

Pulled By: robieta

fbshipit-source-id: c8173399328e6da60a07bfcb4b62e91f7f4fe34a
2020-04-02 12:03:15 -07:00
a53328e89c cmake: Grab TORCH_DEFAULT_VERSION from version.txt (#35260)
Summary:
This variable hasn't been updated in a long time since it usually just
gets overwritten by whatever is in the setup.py but let's set the
default to something a bit more in-line with what we're actually
building.

Closes https://github.com/pytorch/pytorch/issues/35210

cc ksasso1028

Signed-off-by: Eli Uriegas <eliuriegas@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35260

Differential Revision: D20818302

Pulled By: seemethere

fbshipit-source-id: 530fe137e45be1d0ac0233525c80f7099c17b05a
2020-04-02 11:57:47 -07:00
9097b55479 Propagate static_if more completely. (#35834)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35834

This handles the cases we did not handle before in AND and OR statements:

    static_true || <unknown> -> static_true
    static_false && <unknown> -> static_false

Test Plan: Imported from OSS

Differential Revision: D20801125

Pulled By: zdevito

fbshipit-source-id: 0ef94c3a14c7af91580fc5248a4ccfd9e8d6d481
2020-04-02 11:44:34 -07:00
173e444e66 track ddp API usage (#35837)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35837

track ddp API usage
ghstack-source-id: 101344592

Test Plan: check logging

Differential Revision: D20801983

fbshipit-source-id: f3d2e6ab5bde0b1320300d67a327f12ddb2c62e7
2020-04-02 11:35:05 -07:00
602b51eb30 Changes to qadd for perf improvement.
Summary:
qadd calls contiguous on input tensors. This by default does contiguous in NCHW
format (for 4D tensors). We should call
.contiguous(input.suggest_memory_format())
Output allocation also done NCHW format. This results in the subsequent conv
having to do memcpy for NHWC format.
Both of this leads to majority of the time spent in qadd in copying in FBNET_A
model.
Fixing these reduces runtime on S8 phone to 15ms from 17. Reducing the gap
between c2 and PT latency from ~24% to ~9.5%.
Also note that the contract for ops is that they return output tensor in same
format as the input memory format.

Test Plan:
Apply on top of diff D20721889.
bento console --file mobile-vision/projects/model_zoo/scripts/run_create_model_benchmark.py

Note: There are many calls to .contiguous without format specification in
aten/src/ATen/native/quantized/cpu.
All those should be replaced with .contiguous(input.suggest_memory_format())
whenever applicable (Most likely applicable to all elementwise ops)
Also same should apply for output allocation.

Reviewed By: dreiss

Differential Revision: D20794692

fbshipit-source-id: 6b81012497721d48e7d6a5efcc402f315b1dfe77
2020-04-02 11:33:18 -07:00
3ef5ff6012 [TensorExpr] Make Load and Store multi-dimensional. (#35800)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35800

This PR includes the following changes:
* Introduce a new `Expr` type `Buf`: it plays a similar to `Var` role, but also has dimensions.
* Use the new `Buf` class in `Store` and `Load` instead of `Var` for specifying where to store to or load from. `Buf` contains the dimensions info of the buffer we're loading/storing to and hence we are able to keep N-d indexes without flattening them into a 1-d index ([x,y] vs [x+y*W]).
* Flattening of the indexes is now a separate pass that is executed in `LoopNest::prepareForCodegen` - backends still expect indexes to be flattened, and this PR preserves that.
* `Tensor` now contains a `Buf` instead of `Var`, and thus Tensor now has the dimensions info (previously it was a property of a `Function`, not a `Tensor`). This brings us closer to Tensor being a combination of Buffer + Function, where Buffer specifies iteration domain and the Function defines a computation.

TODOs:
* Consider merging `Buffer` with `Buf` or `BufHandle`. It seems that we don't need all of them.
* Harden the logic of how we create buffers in fuser pass. Currently it seems that sometimes we don't set dimensions.
* Use `Buf` in `Allocate` and `Free`.
* Make it clearer that `Function` doesn't "own" dimensions info and that dimensions are a property of a Tensor, not a Function.

Differential Revision: D20789005

Test Plan: Imported from OSS

Reviewed By: zheng-xq

Pulled By: ZolotukhinM

fbshipit-source-id: e04188d1d297f195f1c46669c614557d6bb6cde4
2020-04-02 11:18:28 -07:00
676fc929b7 [caffe2] fix type and shape inference for common gradient ops (#35857)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35857

This fixes a lot of common ops for InferBlobShapesAndTypes as well as adds support for testing the inferred shapes and types of gradient ops.

Ops:
* Concat
* Split
* LeakyReLU
* Relu
* Prelu
* Gelu
* Elu
* Sinh, Tanh, Cosh
* Abs
* ... and a number of other simple element wise ops

Test Plan:
Added support to hypothesis test to check the shape and type of gradient ops.

Enabled it for all the ops I fixed the shape and type inference for.

  buck test caffe2/caffe2/python/operator_test:

Reviewed By: pradeepd24

Differential Revision: D20806284

fbshipit-source-id: 77f796d9ff208e09e871bdbadf9a0a7c196b77f2
2020-04-02 11:17:04 -07:00
c4f56e9685 [pytorch][PR] Optimize qavg_pool3d_nhwc (#35740)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35740

For one of the quantized CV model, the avg_pool3d operation is more than 6x slower than C2 implementation. The reason behind this comes from the following aspects:
- function access inside the loop (such as ```q_scale()``` and ```q_zero_point()```)
- additional data copy in ```Vec256::store``` and ```at::quantize_vec```

This diff resolves the above issue with the following measures:
- lift function access outside the loops
- add an 8-lane path in ```QuantizeAvx2``` to replace ```at::quantize_vec```
- in addition, interchanges c-loop to the innermost for better memory locality.

Test Plan:
buck test //caffe2/test:quantized

Performance Before (n x h x w x c = 4 x 56 x 56 x ch):
```
type            c=2             c=4             c=15            c=24            c=48            c=128           c=256
torch.qint8     903.08 us       1373.39 us      2297.97 us      636.72 us       864.98 us       1618.72 us      2908.47 us
torch.quint8    911.93 us       1429.39 us      2315.59 us      623.08 us       844.17 us       1522.28 us      2711.08 us
torch.qint32    897.77 us       1346.97 us      3846.41 us      6211.92 us      11977.74 us     34348.23 us     62927.48 us
```
Performance After:
```
type            c=2             c=4             c=15            c=24            c=48            c=128           c=256
torch.qint8     123.29 us       176.00 us       348.90 us       99.02 us        132.73 us       267.17 us       513.43 us
torch.quint8    123.76 us       171.90 us       338.17 us       97.92 us        131.06 us       260.09 us       521.16 us
torch.qint32    102.97 us       172.57 us       559.31 us       814.03 us       1606.11 us      4164.89 us      10041.52 us
```

Reviewed By: lly-zero-one

Differential Revision: D20711888

fbshipit-source-id: a71dd55639500f4a036eee96c357737cff9d33db
2020-04-02 11:12:24 -07:00
0f99b28431 Revert D20775783: Add DispatchKey impl overload; remove use of torch::dispatch
Test Plan: revert-hammer

Differential Revision:
D20775783

Original commit changeset: e45b289e5d1f

fbshipit-source-id: 08551428fa886e93cfda14eb51a0f920c335df34
2020-04-02 10:51:50 -07:00
e67951af63 Revert D20775782: Add temporary impl_UNBOXED syntax sugar for unboxed-only defs.
Test Plan: revert-hammer

Differential Revision:
D20775782

Original commit changeset: c5e804c69f59

fbshipit-source-id: 2198e715eb3a24d198a949a44ec192bec523ffb4
2020-04-02 10:48:51 -07:00
86f3305859 Improve C++ API autograd and indexing docs (#35777)
Summary:
This PR adds docs for the following components:
1. Tensor autograd APIs (such as `is_leaf` / `backward` / `detach` / `detach_` / `retain_grad` / `grad` / `register_hook` / `remove_hook`)
2. Autograd APIs: `torch::autograd::backward` / `grad` / `Function` / `AutogradContext`, `torch::NoGradGuard` / `torch::AutoGradMode`
3. Tensor indexing
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35777

Differential Revision: D20810616

Pulled By: yf225

fbshipit-source-id: 60526ec0c5b051021901d89bc3b56861c68758e8
2020-04-02 09:33:11 -07:00
6d24f8fe21 Infrastructure for a new CUDA Fuser (#34785)
Summary:
**Summary:** This PR contains the infrastructure of a new CUDA fuser. This CUDA fuser is based on many of the same principles of TensorExpressions and Halide, however the implementation is ground up. The fusion pass itself is similar to the default CUDA fuser, however, it has undergone some refactoring and is using the new code generation infrastructure. For those who are interested in how the code generation in this PR works, I would recommend reviewing _test/cpp/jit/test_gpu_fusion.cpp_ as well as the long comment section at the beginning of _torch/csrc/jit/codegen/cuda/transform_replay.h_  One of the largest differences between our approach and that of TVM/Halide, is the concept of "TensorView". TensorView from a high level should be thought of similarly to how we think of working with Tensors in PyTorch. It's an N-D object which can undergo transformations that change its dimensionality. Dimensionality changes are done through the operations split/merge/reorder/computeAt. These transformations are similar to split/fuse/reorder/compute_at of TVM, they modify how a tensor is iterated over to generate GPU code. Interestingly, in our scheme these transformations are applied to tensors and only impact how that tensor is generated.

**Warning:** This PR is purposefully not feature complete with the current fuser. We wanted to separate out the infrastructure from the fusion capabilities. Once in, smaller incremental PRs will be submitted to expand capabilities of the fuser.

**Short term goals:**

Parity with current CUDA fuser (including performance):
- Dynamic shapes (no recompilation)
- Implicit handling of braodcast (broadcasted tensors are treated as tensors of the braodcasted size in the generated code)
- Dropout

**Mid-term goals:**

- Transposes fused with pointwise operations where transpose involves only 2 axes (across the fused operation).
- 1-D reductions fused with pointwise operations
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34785

Reviewed By: ZolotukhinM

Differential Revision: D20650977

Pulled By: soumith

fbshipit-source-id: ee39c95a880e1b9822e874ed4cc180971572bf63
2020-04-02 09:22:42 -07:00
8e951c5793 Add temporary impl_UNBOXED syntax sugar for unboxed-only defs. (#35714)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35714

There are a lot of unboxed only defs.  We're committed to removing
them at the end of the half but as I am about to do a lot of porting
to the new API, let's get them into a form where they're easy to
remove.  This is a new overload impl_UNBOXED that will pass
the function pointer straight to CppFunction::makeUnboxedOnly

I don't attempt to make the _UNBOXED API complete; in particular,
catchall declarations don't get this sugar (as there are very few
of them).

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Differential Revision: D20775782

Pulled By: ezyang

fbshipit-source-id: c5e804c69f5961c9d4862f6c5dbbe4c524cc32cc
2020-04-02 08:52:54 -07:00
2db61193bb Add DispatchKey impl overload; remove use of torch::dispatch (#35706)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35706

It is extremely common to define implementations of operators at a
specific dispatch key, so we add an overload to impl specifically for
this case.  I then delete most uses of torch::dispatch

dispatch_autograd call sites can't make use of this overload.  So
instead the new preferred way to specify something as autograd is to
pass kAutograd as the dispatch key (short form, analogous to kCPU/kCUDA
which we support today).

I flip flopped about whether or not kAutograd should have the type
DispatchKey or some other type (to help better encapsulate the
DispatchKey enum); this is more direct and I can't think of any
BC problems from this usage.

Some other reorganization I did:
- I renamed all of the worker functions in op_registration to have
  a leading underscore and made them private, just to make it more
  clear what the public versus private API were (the private API
  shouldn't be used by users because it doesn't come with && overloads)
- In a few places where I was touching lines already, I replaced
  full DispatchKey typed out enums with shorter kFoo names, similar
  to kAutograd but I didn't publish these globally.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Differential Revision: D20775783

Pulled By: ezyang

fbshipit-source-id: e45b289e5d1f86c180b24cf14c63cf4459ab5337
2020-04-02 08:51:22 -07:00
c3abcf83aa [AI Bench] Resumme speed_benchmark_torch.cc to origin
Summary: we removed all assistant specific code

Test Plan:
```
buck run aibench:run_bench -- -b aibench/specifications/models/pytorch/fbnet/fbnet_mobile_inference.json --platform android/full_jit --framework pytorch --remote --devices  SM-G950U-7.0-24
```

https://our.intern.facebook.com/intern/aibench/details/940147322057842

Reviewed By: kimishpatel

Differential Revision: D20686220

fbshipit-source-id: b7336d5ea15fa11be01abf4ad12747feaaf22ea8
2020-04-02 08:35:46 -07:00
1bd68eafb5 Skip ROCm test in test/test_cpp_extensions_aot.py (#35838)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35838

It may be flaky.

Test Plan: Imported from OSS

Differential Revision: D20807409

Pulled By: gchanan

fbshipit-source-id: f085d05bcb6a04d304f3cd048c38d2e8453125d6
2020-04-02 08:28:46 -07:00
051132f119 [TensorExpr] simplification of round + mod pattern. (#35683)
Summary:
Adds capabilities to the TensorExpr IR Simplifier to simplify down Round + Mod patterns (e.g. `(x/y)*y + x%y => x`) via means of lifting integer rounding into a temporary `RoundOff` node.

This integrates with existing simplification mechanisms (folding, factorization, reordering, etc) to allow simplification of compound expressions: e.g. `20 * (x  / (16 / 2)) * 2 + (11 % 6) * (x % (7+1)) => 5 * x.`.

Tests: ran tensorexpr cpp and python tests, ran a hpc benchmark and verified results and time didn't regress.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35683

Differential Revision: D20811316

Pulled By: nickgg

fbshipit-source-id: 0cd6a517fb9548b3bc689768304b97375df5ac58
2020-04-02 00:11:00 -07:00
2f50c11954 add test_tensorexpr.py (#35776)
Summary:
Adding `test_tensorexpr.py` to our CI. There's a few complications: the first one is that we now always run `SimpleIREVal` as a part of simplifier, so the counts will always be greater than one. We can potentially invest some effort to differentiate between a real codegen call to `SimpleIREval` and calls in simplifier, but it's probably not that important and the second change to turn not being able to retrieve a counter into a default value of 0 since the test are structured to test for either an llvm or simpleireval backends, so it only seems appropriate to not fail the test too early.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35776

Differential Revision: D20799333

Pulled By: Krovatkin

fbshipit-source-id: 2a94ff98e647180c6e6aea141a411c3376c509f9
2020-04-01 22:05:37 -07:00
6616fad92e [Docs] Fix typo in RPC docs (#35809)
Summary:
It's also fixed in the cherry pick PR https://github.com/pytorch/pytorch/pull/35808
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35809

Differential Revision: D20803338

Pulled By: rohan-varma

fbshipit-source-id: 1925f367703faf053ab4b1c0ff0acb86230c5d89
2020-04-01 21:16:12 -07:00
b33ae23c5a Revert D20794765: [pytorch][PR] Improve C++ API autograd and indexing docs
Test Plan: revert-hammer

Differential Revision:
D20794765

Original commit changeset: fad623e5d505

fbshipit-source-id: 041fb7257d4978a3767d8229d70d6f3cc55e5f28
2020-04-01 20:14:13 -07:00
6792dac90d Only Schedule Retries before Agent Shutdown (#35554)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35554

We attach a callback to our RPC send attempts that schedule a retry
upon failure. This PR only schedules the retry if the agent is running.
ghstack-source-id: 101332815

Differential Revision: D20612615

fbshipit-source-id: e1bbb3f162101bce7eb46bad512c9e5dc6d531cc
2020-04-01 19:03:09 -07:00
b3c0939af3 [quant][graphmode][refactor] Move the whitelists to a centeralized place (#35721)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/35721

Test Plan:
.

Imported from OSS

Differential Revision: D20771829

fbshipit-source-id: f6ec3afe2d8034acbdbd81e5a6fbd4a2a76aa7ac
2020-04-01 18:26:39 -07:00
e372f42110 [caffe2] Explicit vectorization of LSTM operator (#35556)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35556

Pull Request resolved: https://github.com/pytorch/pytorch/pull/35542

Apply explicit vectorization to lstm_unit operator.
Enabled by -DENABLE_VECTORIZATION=1

This optimization requires vector library support and was tested with Intel SVML & clang.
However, compiler which support OpenMP4.5 with omp simd extention should also benefit.

After the code changes
In file included from caffe2/caffe2/operators/lstm_unit_op.cc:1:
caffe2/caffe2/operators/lstm_unit_op.h:60:1: remark: vectorized loop (vectorization width: 8, interleaved count: 1) [-Rpass=loop-vectorize]
VECTOR_LOOP for (int d = 0; d < D; ++d) {

caffe2/caffe2/operators/lstm_unit_op.h:60:1: remark: vectorized loop (vectorization width: 8, interleaved count: 1) [-Rpass=loop-vectorize]
caffe2/caffe2/operators/lstm_unit_op.h:112:1: remark: vectorized loop (vectorization width: 8, interleaved count: 1) [-Rpass=loop-vectorize]
VECTOR_LOOP for (int d = 0; d < D; ++d) {

Test Plan:
Check failures at OSS CI
- No build failures related to this change
- Failing tests are:
  - py3.6-clang7-rocmdeb-ubuntu16.04-test2
>RuntimeError: fft: ATen not compiled with MKL support
  - caffe2_onnx_ort2_py3_6_clang7_ubuntu16_04_test -
>gradient_check_test.py::TestMakeTwo
Exited with code exit status 1
- pytorch_macos_10_13_py3_test , Test errors like:
> ERROR [0.014s]: test_boolean_indexing_weirdness_cpu (__main__.NumpyTestsCPU)
RuntimeError: shape mismatch: indexing tensors could not be broadcast together with shapes [0], [2]
- caffe2_onnx_ort1_py3_6_clang7_ubuntu16_04_test
  - No failure info

Reviewed By: jspark1105

Differential Revision: D20484640

fbshipit-source-id: 8fb82dbd6698c8de3e0bbbc0b48d15b70e36ca94
2020-04-01 17:19:56 -07:00
e0ee8000ac Make test_leaky_relu_inplace_with_neg_slope device-generic and skipIfRocm. (#35816)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35816

Fixes https://github.com/pytorch/pytorch/issues/35689.

Test Plan: Imported from OSS

Differential Revision: D20796656

Pulled By: gchanan

fbshipit-source-id: 474790fe07899d9944644f6b3d7a15db1c2b96db
2020-04-01 17:02:04 -07:00
41ef2c0d58 Improve C++ API autograd and indexing docs (#35777)
Summary:
This PR adds docs for the following components:
1. Tensor autograd APIs (such as `is_leaf` / `backward` / `detach` / `detach_` / `retain_grad` / `grad` / `register_hook` / `remove_hook`)
2. Autograd APIs: `torch::autograd::backward` / `grad` / `Function` / `AutogradContext`, `torch::NoGradGuard` / `torch::AutoGradMode`
3. Tensor indexing
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35777

Differential Revision: D20794765

Pulled By: yf225

fbshipit-source-id: fad623e5d505b7cfcd76a8c5264f18b7a0a3298c
2020-04-01 16:54:08 -07:00
301be851ef Fix grid_sample out of boundary when grid contains large numbers (#35506)
Summary:
This PR would fix https://github.com/pytorch/pytorch/issues/35202, fix GPU part of https://github.com/pytorch/pytorch/issues/24823, be related to https://github.com/pytorch/pytorch/issues/24870.

Here is the origin of this problem.
1. Like those in https://github.com/pytorch/pytorch/issues/35202, with large numbers in grid like `grid.min() == -10059144 grid.max()==67680944`; or `nan, inf, 1.0E20` in https://github.com/pytorch/pytorch/issues/24823,
4d39aeec27/aten/src/ATen/native/cuda/GridSampler.cu (L309-L321)
`ix, iy` will be unnormalized to very large numbers, exceed the bound of INT_MAX.
Then, those `ix_nw, iy_nw` variables will be cast to INT_MAX, and some other variables with "+1" will be INT_MIN.

2. However, these INT_MAX, INT_MIN should not big problems, because
4d39aeec27/aten/src/ATen/native/cuda/GridSampler.cu (L358-L362)
4d39aeec27/aten/src/ATen/native/cuda/GridSampler.cuh (L202-L205)
these `within_bounds_2d` functions are supposed to guard the if-statement, prevent the illegal memory access, and leave those output values as zero (padding_modes='zeros').

3. Now here comes the problem, `within_bounds_2d` is set to "inline". We found that those `+1` statement and `>=0` statement may cause compiler to "optimize" the code, that is:
```cpp
int B = something;

int a = something;
int b = a + 1;
bool r = (b >= 0 && b < B);
```
will be compiled into assembly code like
```cpp
int B = something;

int a = something;
bool r1 = (a > -2)
int b = a + 1;
bool r2 = (b < B);
bool r = r1 && r2;
```
This looks nice, but when a = INT_MAX, `a+1` causes Undefined Behavior. Typically, we get b = INT_MIN, then the boolean result from compiled assembly will be true. The `within_bounds_2d` no longer guards us from the illegal memory access.

4. There could be different ways to fix this bug. For example, we may set all of the "ix_nw, iy_nw" values to `int64_t`. That would be a potential performance issue, and doesn't prevent those examples in https://github.com/pytorch/pytorch/issues/24823 with 1E20 in grid.

One minimal fix that I found is to restrict `within_bounds_2d` from being inlined. Thus, compiler won't optimize those `a+1` and `a>=0` code together.

I did a short performace test, just to make sure this forced noinline solution won't cause regression. The performance script can be found at
a6f8bce522/grid-sample/grid-sample.ipynb.

For this `__attribute__((noinline))` macro, I have tested that on nvcc, and there was no problem. I'm not sure if that also works on clang.

cc csarofeen ptrblck ngimel bnehoran zasdfgbnm SsnL
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35506

Differential Revision: D20799304

Pulled By: ngimel

fbshipit-source-id: fc70289b35039fad954908a990ab0a2f16fbfcb2
2020-04-01 14:38:30 -07:00
16774f7353 Increase TimerTest tolerance to 20% on Windows (#35818)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/35818

Test Plan: CI

Differential Revision: D20798424

Pulled By: malfet

fbshipit-source-id: 57e8d9c6b93903a6632168a4a35bf946d8c518aa
2020-04-01 14:29:05 -07:00
9fe3b1857d [TensorExpr] Fix imports in tensorexpr benchmarks. (#35830)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/35830

Test Plan: Imported from OSS

Differential Revision: D20799464

Pulled By: ZolotukhinM

fbshipit-source-id: 1b5981ad15042f601a9b6eb01a799cdf71200666
2020-04-01 14:23:33 -07:00
1c93a19a7f Fix another case of float2::x and float2::y may not be the same on ROCm (#35785)
Summary:
This is another case of the issue fixed in https://github.com/pytorch/pytorch/pull/35783. Mirroring 35786.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35785

Differential Revision: D20800317

Pulled By: ezyang

fbshipit-source-id: de5f32839755d5ff5aefff8408df69adbab4d0a1
2020-04-01 14:14:30 -07:00
50b0bb6c6a Updating submodules
Summary:
GitHub commits:

be34fbe8a4
c5bc292372
a09ba2acd7
c1beec58f4
a643e68d6d
2da6546f44
57096ab13e
be56f1c78e
204dff9f76
79103e7664
dba77af4fd
03c4c1bf82
896dffc48f
815e209e4f

Test Plan: n/a

Reviewed By: yns88

fbshipit-source-id: 63fad0d8a163a7b5f0107c6b5642cb227f73a2ae
2020-04-01 13:55:49 -07:00
26ee0eee10 Use cufft_static_nocallback (#35813)
Summary:
Hattip to ngimel
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35813

Test Plan: CI

Differential Revision: D20800789

Pulled By: malfet

fbshipit-source-id: a51cedfc7dfc68ac59d4f00f12eaff43cf1fdd7a
2020-04-01 13:43:49 -07:00
5d1205bf02 Suppress output when checking hipcc (#35789)
Summary:
Otherwise, it will print some message when hipcc is not found.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35789

Differential Revision: D20793089

Pulled By: ezyang

fbshipit-source-id: 4b3cb29fb1d74a1931603ee01e669013ccae9685
2020-04-01 13:03:21 -07:00
16a88e4369 Add unboxedCallRedispatch (#35476)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35476

A few things:
- Add new callUnboxedRedispatch function which can be used to do a
  redispatch when you don't want to add a type id to the excluded
  set.  This will recompute the dispatch key but ignore everything
  including and before the currentDispatchKey
- Add FULL_AFTER constructor to DispatchKeySet; used to implement
  redispatch.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Differential Revision: D20680518

Test Plan: Imported from OSS

Pulled By: ezyang

fbshipit-source-id: ecd7fbdfa916d0d2550a5b19dd3ee4a9f2272457
2020-04-01 12:48:33 -07:00
ab26dfb44e [quant] Move quantization tests into test/quantization (#35812)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/35812

Test Plan:
.

Imported from OSS

Differential Revision: D20795329

fbshipit-source-id: 42cc905c44ce7b86720aeef512d747ff6788d7a2
2020-04-01 12:44:19 -07:00
15326fb240 Revert "Attempt to fix windows build" (#35217)
Summary:
This reverts commit 0c222555ce82f2caf497e2fea2f2844bdd67e9e5.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35217

Differential Revision: D20793032

Pulled By: ezyang

fbshipit-source-id: 66132f6007db2932aafcbdb09d89101cb944bab1
2020-04-01 12:42:44 -07:00
990b54146f Revert D16864196: [pytorch][PR] port fmod from TH to ATen
Test Plan: revert-hammer

Differential Revision:
D16864196

Original commit changeset: d884cc9e74bb

fbshipit-source-id: d52a4ae715698c92d5878c2b6876cd98dc80b5ac
2020-04-01 12:34:17 -07:00
35cdb78522 Make kl_div accept target in log space (#34586)
Summary:
Fixes [32520](https://github.com/pytorch/pytorch/issues/32520), implements [34536](https://github.com/pytorch/pytorch/issues/34536).

Here are some benchmarks:
```python
import torch
import torch.nn.functional as F
from IPython import get_ipython

ipython = get_ipython()

torch.set_num_threads(1)

for d in [5, 10, 20, 50, 100, 1000]:
    i = torch.rand(d, d)
    t = torch.rand(d, d)
    print(f"Size: {d}x{d}")
    ipython.magic("timeit F.kl_div(i, t, reduction='none', log_target=False)")
    ipython.magic("timeit F.kl_div(i, t.log(), reduction='none', log_target=True)")
```
Output:
```
Size: 5x5
16 µs ± 33 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
8.24 µs ± 17.3 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
Size: 10x10
16.7 µs ± 17.5 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
8.7 µs ± 20.6 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
Size: 20x20
17.7 µs ± 47.5 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
9.7 µs ± 28.8 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
Size: 50x50
23.6 µs ± 60.1 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
15 µs ± 33.7 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
Size: 100x100
42.8 µs ± 223 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
34 µs ± 17.2 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
Size: 1000x1000
3.9 ms ± 1.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
3.45 ms ± 364 ns per loop (mean ± std. dev. of 7 runs, 100 loops each)

```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34586

Differential Revision: D20652726

Pulled By: ezyang

fbshipit-source-id: 480697b4cd01341bbeee7514a8b812705a0600ea
2020-04-01 12:26:58 -07:00
74ef0adf60 add mv operator to SparseTensor (#21782)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/21266  add mv operator to SparseTensor
Pull Request resolved: https://github.com/pytorch/pytorch/pull/21782

Differential Revision: D20794372

Pulled By: ezyang

fbshipit-source-id: 6b396357d512f7a5860da83e7976c33bf92cf974
2020-04-01 12:21:50 -07:00
2b068d10b0 Removing references to PYTHON3COMPATIMPORTS. (#35384)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35384

Removing references to PYTHON3COMPATIMPORTS, mostly suppressions but
removed one instance of usage in a bash script.

Fixed errors arc lint uncovered.

Test Plan:
arc lint
Sandcastle tests

Reviewed By: zertosh

Differential Revision: D20635401

fbshipit-source-id: 74c6b5edb85a78a44f96b96f72ee75a9c2d029f1
2020-04-01 10:34:04 -07:00
acb59a3b86 Remove unused header in process_group_agent.h (#35767)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/35767

Test Plan: Imported from OSS

Differential Revision: D20771900

Pulled By: mrshenli

fbshipit-source-id: af88abfabfbe3d2d94407942738f7dcdfc3f30e2
2020-04-01 10:28:41 -07:00
6491bf2855 Revert D20777341: [pytorch][PR] Add __torch_function__ benchmarks.
Test Plan: revert-hammer

Differential Revision:
D20777341

Original commit changeset: 6aaaf2a07553

fbshipit-source-id: 1c324f91f85ac624bf878297c96c682a46958954
2020-04-01 10:23:00 -07:00
e9d868a529 Kill CUDA_tensor_apply4 (#33998)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/33998

Test Plan: Imported from OSS

Differential Revision: D20196787

Pulled By: VitalyFedyunin

fbshipit-source-id: 1978a014efb4a18ef9fcc7ad928a4264af4a297a
2020-04-01 10:07:47 -07:00
d463c10668 Migrate prelu_cuda_backward from CUDA_tensor_apply4 to TensorIterator (#33997)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/33997

Test Plan: Imported from OSS

Differential Revision: D20196786

Pulled By: VitalyFedyunin

fbshipit-source-id: 5d54a9226bd7f369f582192842be2fcd384bc7af
2020-04-01 10:06:16 -07:00
ceff21a4fc port fmod from TH to ATen (#24405)
Summary:
https://github.com/pytorch/pytorch/issues/22803

performance benchmarks:

    import timeit
    import torch
    import itertools
    import statistics

    def test_perf(sizes, device, repeat, times):
        def _tensor(name, sizes, device):
            return '''{0} = torch.rand({1}, device="{2}");'''.format(name, sizes, device)

        setup_code = 'import torch;' + _tensor('x', sizes, device) + _tensor('y', sizes, device)
        test_code = '''torch.fmod(y, x);'''
        if device == "cuda":
            test_code = test_code + 'torch.cuda.synchronize()'
        result = timeit.repeat(setup = setup_code,stmt = test_code,repeat = repeat,number = times)
        mean = statistics.mean(result)
        std = statistics.stdev(result)
        print('''sizes = {0} std = {1} mean = {2}'''.format(sizes, std, mean))

    def test_perf_for_device(device, small, mid, large):
        print(device)
        for s in itertools.product((small, mid, large), (small, mid, large)):
            test_perf(str(s), device, 3, 300)

    test_perf_for_device("cpu", 5, 100, 1000)
    test_perf_for_device("cuda", 5, 100, 10000)

pytorch:master

	cpu
	sizes = (5, 5) std = 0.0004191587896767566 mean = 0.0052408403377436725
	sizes = (5, 100) std = 0.00012129380478190695 mean = 0.006508304664748721
	sizes = (5, 1000) std = 0.00018175678335131663 mean = 0.0363664986701527
	sizes = (100, 5) std = 0.00034399426107962946 mean = 0.006770268999389373
	sizes = (100, 100) std = 0.0006779367543473553 mean = 0.07270567266580959
	sizes = (100, 1000) std = 0.01670362224705441 mean = 0.1300258070017056
	sizes = (1000, 5) std = 0.010281040640935534 mean = 0.045936293997025736
	sizes = (1000, 100) std = 0.012529932966256128 mean = 0.12733882099564653
	sizes = (1000, 1000) std = 0.002150238308503937 mean = 1.1608000710014796
	cuda
	sizes = (5, 5) std = 0.00016137550559233116 mean = 0.014315356330674453
	sizes = (5, 100) std = 0.0014720358192929545 mean = 0.015730336332732502
	sizes = (5, 10000) std = 0.0017510024071247026 mean = 0.015462367334597124
	sizes = (100, 5) std = 0.001569950832690219 mean = 0.015847195667447522
	sizes = (100, 100) std = 0.000935629392520788 mean = 0.015551854667137377
	sizes = (100, 10000) std = 0.002454919985869727 mean = 0.04476405966367262
	sizes = (10000, 5) std = 0.0013192075275361463 mean = 0.015794202001416124
	sizes = (10000, 100) std = 0.001418935833245521 mean = 0.04419450566638261
	sizes = (10000, 10000) std = 0.0070977799177425 mean = 3.267501967328523

shihongzhi:feature/port_fmod

	cpu
	sizes = (5, 5) std = 0.0003939277361171243 mean = 0.008732202996422226
	sizes = (5, 100) std = 7.568185896146914e-05 mean = 0.010897216998273507
	sizes = (5, 1000) std = 3.916722355255723e-05 mean = 0.03223436966557832
	sizes = (100, 5) std = 0.00016529833171236708 mean = 0.011018406672519632
	sizes = (100, 100) std = 0.000155446405937598 mean = 0.055315166668151505
	sizes = (100, 1000) std = 0.005295612670839835 mean = 0.09823771333321929
	sizes = (1000, 5) std = 5.087993715488194e-05 mean = 0.03315563267096877
	sizes = (1000, 100) std = 0.004952377745126246 mean = 0.09605619766807649
	sizes = (1000, 1000) std = 0.10362095898303665 mean = 0.9652185496718934
	cuda
	sizes = (5, 5) std = 5.004076916927963e-05 mean = 0.016851375335439418
	sizes = (5, 100) std = 0.0008912925390246038 mean = 0.01788881132476187
	sizes = (5, 10000) std = 0.0009701942336158022 mean = 0.018210363331794117
	sizes = (100, 5) std = 0.0007897575234315655 mean = 0.017682057005004026
	sizes = (100, 100) std = 0.0012395220098068511 mean = 0.016444508665396523
	sizes = (100, 10000) std = 0.000957364387413519 mean = 0.016943917328414198
	sizes = (10000, 5) std = 0.0011325899538680206 mean = 0.017102815332085203
	sizes = (10000, 100) std = 0.0013052748368152663 mean = 0.017058989333842572
	sizes = (10000, 10000) std = 0.024267574119715446 mean = 0.30735275766831666
Pull Request resolved: https://github.com/pytorch/pytorch/pull/24405

Differential Revision: D16864196

Pulled By: VitalyFedyunin

fbshipit-source-id: d884cc9e74bb8f4ce2ad8d23c676fa914b26d8fb
2020-04-01 09:49:11 -07:00
a736b994b7 Remove old section of the aten doc that is not true anymore (#35807)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/35807

Differential Revision: D20794708

Pulled By: albanD

fbshipit-source-id: 8e67c369bd17f6527dd024c89fd2481ecd7f6ee1
2020-04-01 09:47:44 -07:00
945d7a7408 Add All-to-all comms support to distributed module and MPI backend (#32361)
Summary:
As described in https://github.com/pytorch/pytorch/issues/32345, a prototype implementation to add an alltoall communication primitive to torch.distributed module and ProcessGroup abstract interface. Also, implements alltoall in ProcessGroupMPI backend.

mnaumovfb JianpingChen066 dmudiger srinivas212 Jianhui-Li mshiryaev ftian1

cc pietern mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini xush6528 osalpekar
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32361

Reviewed By: mrshenli

Differential Revision: D20635481

Pulled By: srinivas212

fbshipit-source-id: 3dd0af800ce55d02f02813cde550e3a0f1a287d2
2020-04-01 08:57:12 -07:00
409bac48e4 Move all warn logic for overwriting registration to OperatorEntry (#35769)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35769

This fixes a bug where correct end user API usage can still trigger
a warning because we don't preserve the invariants DispatchTable
was previously expecting to be done.  So now, OperatorEntry is
the source of truth, and it just whacks DispatchTable until its
the correct state.  OperatorEntry does the user-facing checking.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Differential Revision: D20772383

Pulled By: ezyang

fbshipit-source-id: 167d249a826d7b02361ba0a44571813c829649c1
2020-04-01 08:17:46 -07:00
6318899c9b [ROCm] [ROCm 2.10+] enable fp16 dot in PyTorch backend (#30431)
Summary:
ROCm 2.10 has a hdot implementation. Use it and enable test.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30431

Differential Revision: D20776784

Pulled By: ezyang

fbshipit-source-id: a192a701eb418dac2015e300563ade691c24903e
2020-04-01 07:49:13 -07:00
8c534bb0bd Add __torch_function__ benchmarks. (#35530)
Summary:
Since the last one was apparently reverted.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35530

Differential Revision: D20777341

Pulled By: ezyang

fbshipit-source-id: 6aaaf2a0755359074ae3d0efe32018d78dafe976
2020-04-01 06:30:17 -07:00
60a3e82c4e [ONNX] Fix for constant folding: Slice, Added ReduceL1 and ReduceL2 (#35280)
Summary:
1- Added support for constant folding onnx::ReduceL1 and onnx::ReduceL2
2- Fixed constant folding for slice as onnx::Slice opset 11 supports negative axes and indices
3- Updated export of select opset 11
4- Separated test environment for test_utility_functions as environment variables could be overwritten by caffe2 quantization tests on CI
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35280

Reviewed By: hl475

Differential Revision: D20626140

Pulled By: houseroad

fbshipit-source-id: 39667c7852eeaa97d9da23f53da52760d3670ecf
2020-04-01 04:47:47 -07:00
1f06db2579 Refactored rpc docs (#35109)
Summary:
Reorganize as per jlin27 's comments. Screenshots added in comments.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35109

Differential Revision: D20788774

Pulled By: rohan-varma

fbshipit-source-id: 7d64be70ef76ed6ff303d05d39c338293c234766
2020-04-01 02:01:34 -07:00
a5bfcc5323 Unify management of thread local settings (#35523)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35523

In this PR we extend ThreadLocalState to cover dispatch keys and
ThreadLocalDebugInfo and move it from JIT interpreter down to
thread management (at::launch) and autograd (backward threads) code

Test Plan: unit tests (CI)

Reviewed By: dzhulgakov

Differential Revision: D20615714

fbshipit-source-id: 16a9fc96a25cb6c2629230b1187fbf78786ac565
2020-04-01 01:56:39 -07:00
bc6bd0bb1a Debug Information Guard
Summary: This diff fixes the issues with current handling of debug information passed along the execution of the model. (For example, it is possible that multiple calls to the debug guard may override each other)

Test Plan: CI test/cpp/jit

Reviewed By: dzhulgakov

Differential Revision: D20602775

fbshipit-source-id: 4683957954028af81a1a0f1f12b243650230c9bb
2020-04-01 01:55:29 -07:00
fd1dfaa7d0 [jit] kill isSameIdentity (#35019)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/35019

Test Plan: Imported from OSS

Differential Revision: D20537903

Pulled By: suo

fbshipit-source-id: 4610279f93c53dda30a8a555177f85edb73eea02
2020-04-01 01:47:17 -07:00
2d85daca58 [jit] kill shallowEquals (#35005)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35005

This is one of the ad-hoc IValue equality implementations that should be
replaced with `operator==`.

Test Plan: Imported from OSS

Differential Revision: D20537900

Pulled By: suo

fbshipit-source-id: 5f31ee2386f9d0b33f2bc047a39351191f4d81b0
2020-04-01 01:47:12 -07:00
c382ec88d1 [jit] define equality for IValue (#34986)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34986

Previously we were reluctant to define equality for IValues, as it's not
totally straightforward. But the vacuum this created basically forced
people to define their own equality comparisons for their own purposes.
We have at least 3 in PyTorch itself, and 2 others outside that I know
of.

These implementations are generally wrong, so we should just bite the
bullet and define equality canonically.

Test Plan: Imported from OSS

Differential Revision: D20537901

Pulled By: suo

fbshipit-source-id: 8d770a31bf6de6f3b38f9826bf898d62c0ccf34e
2020-04-01 01:46:00 -07:00
0ed3f881c5 clang-fmt (#35796)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/35796

Test Plan: Imported from OSS

Reviewed By: shannonzhu

Differential Revision: D20788673

Pulled By: suo

fbshipit-source-id: 3555a6204ef174c28e561a8931e13814846813a3
2020-04-01 00:14:36 -07:00
866d9d4e6a [jit] Fix name collision on load (#35720)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35720

When modules are saved, all relevant types are serialized according to
their qualified name with a compilation unit. Since qualified names are
guaranteed to be unique within a compilation unit, this normally works
fine.

On load, all types are registered in a compilation unit owned by the
script::Module. Type names are not unique across compilation units, so
if you load two modules with colliding type names, make them submodules
of yet another module, and save that module, there is the potential of a
name collision. See the added tests for examples if that description is
confusing.

The solution is to unique type names when serializing code by mangling
them if we detect a name collision.

Test Plan: Imported from OSS

Differential Revision: D20749423

Pulled By: suo

fbshipit-source-id: a8827ff1d4a89f3e7964dbbb49b4381863da3e6a
2020-04-01 00:02:38 -07:00
ee6f7c3e62 Remove extra semicolon (#35751)
Summary:
Which suppresses lots of warning during compilation.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35751

Differential Revision: D20788302

Pulled By: jamesr66a

fbshipit-source-id: 9bc598ab27b87c28c3011597a39d695355cf4157
2020-03-31 23:33:17 -07:00
dc1ecdf8d9 Moves torch cpu math tests to device-generic framework (#35658)
Summary:
Per title. Also, replaces reference computation with `math.xx` functions and torch.apply_  with numpy/scipy as appropriate.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35658

Differential Revision: D20744541

Pulled By: ngimel

fbshipit-source-id: a16ea506397c07f09f0d7f1c54fd8017418bd506
2020-03-31 23:28:38 -07:00
319aee1afb Revert D20771828: [quant] Move quantization tests into test/quantization
Test Plan: revert-hammer

Differential Revision:
D20771828

Original commit changeset: 5f1df5e86c29

fbshipit-source-id: d14f915f291ae8a90026c5b65624459211495f47
2020-03-31 23:01:00 -07:00
06dcb70905 [jit] Fix Type equality in some cases (#35719)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/35719

Test Plan: Imported from OSS

Differential Revision: D20749422

Pulled By: suo

fbshipit-source-id: 09b697766c1eb3e56f4cf8acc7e854b0981d7991
2020-03-31 22:29:12 -07:00
51fb5ef80e [jit] add cast<> specialization for NamedType (#35718)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35718

Because NamedType is not a concrete type (it's just an interface), it
has no corresponding TypeKind and thus no default `cast()` behavior.
Adding a specialization that does the right thing.

Test Plan: Imported from OSS

Differential Revision: D20749425

Pulled By: suo

fbshipit-source-id: 6ccab1cca26fd2b2805189fcf2305d99ae28145a
2020-03-31 22:29:07 -07:00
995f53b042 [jit] make python_str take a custom renamer (#35717)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35717

We need to provide calling code the ability to customize how type names
are printed. Will be used to mangle names in python_print, stacked on
top.

Test Plan: Imported from OSS

Differential Revision: D20749424

Pulled By: suo

fbshipit-source-id: f110ab569c81e8934487295cd67009fc626ac194
2020-03-31 22:29:02 -07:00
6a5d008abf [jit] factor mangler out (#35716)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/35716

Test Plan: Imported from OSS

Differential Revision: D20749426

Pulled By: suo

fbshipit-source-id: ec148abf86ab17113d0c71a50375842f8a9ada0e
2020-03-31 22:27:39 -07:00
cae6bdf199 [JIT] Mark aten::wait as having side effect, since it can represent RPC message received (#35695)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35695

aten::wait was optimized out, causing RPC futures are not waited on.

Test Plan:
```
buck test mode/dev-nosan //caffe2/torch/fb/distributed/model_parallel/tests:test_dist_optim
```

```
buck test mode/dev-nosan //caffe2/test/distributed/rpc/jit:rpc_fork

buck build mode/dev-nosan //caffe2/test/distributed/rpc/jit:rpc_fork && \
buck-out/gen/caffe2/test/distributed/rpc/jit/rpc_fork\#binary.par \
-r test_python_future_with_jit
```

```
buck build mode/dev-nosan //caffe2/test:jit && \
buck-out/gen/caffe2/test/jit\#binary.par -r test_trace_fork_wait_inline
```

```
buck build mode/dev-nosan //caffe2/test:jit && \
buck-out/gen/caffe2/test/jit\#binary.par -r test_trace_fork_wait_inline_onnx
```

Differential Revision: D9562716

fbshipit-source-id: 35b2c971efa42949ffdf0910bd75a927eee8d965
2020-03-31 22:17:25 -07:00
d1a4a64092 Disables imag for real-valued tensors (#35728)
Summary:
In NumPy, calling np.imag on a real-valued tensors returns a non-writable tensor (view) of zeros. In PyTorch we don't support non-writeable tensors (or views), so we can either return a writable tensor or error.

If we do the former, that may confuse people who try to write to the imaginary part of a real-valued tensor, and may cause a BC issue if we do support non-writable tensors. This PR errors to provide us flexibility implementation the solution we'd like in the future, while protecting users from unexpected behavior today.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35728

Differential Revision: D20760687

Pulled By: mruberry

fbshipit-source-id: f60d445746cc75ba558804c853993d9e4621dad3
2020-03-31 21:34:46 -07:00
2f84a07b58 indexing: throw exception for masks with dtype=uint8 (#34418)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/33751
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34418

Differential Revision: D20776164

Pulled By: ngimel

fbshipit-source-id: f4ebaabf427d7967f2f317235562f91c8f9216f0
2020-03-31 20:51:56 -07:00
fef6c617d4 [quant] Move quantization tests into test/quantization (#35688)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/35688

Test Plan:
.

Imported from OSS

Differential Revision: D20771828

fbshipit-source-id: 5f1df5e86c29f7bdfbdc6563450e909b3bfdc07a
2020-03-31 20:30:57 -07:00
1ec0676a33 [JIT] register list prim ops cleanup (#35768)
Summary:
This is a follow up from https://github.com/pytorch/pytorch/pull/34520, which removed specialized list ops. This removes templating from list ops.

it also has one minor other change, which is to move `aten::len(t[]) -> int` to `aten::len(Any[]) -> int` so that heterogenous tuples can be called with `len()`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35768

Differential Revision: D20772943

Pulled By: eellison

fbshipit-source-id: bc36a00920bc94ca8c5aa9eb7d5d7a640388ffbb
2020-03-31 19:24:59 -07:00
2c6d1e57cd is_complex doc fix (#35680)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/35680

Differential Revision: D20740814

Pulled By: anjali411

fbshipit-source-id: dd35594ef7661a2876479b974b37be83cf472f44
2020-03-31 18:14:51 -07:00
3ba885896d [jit] Minor: in unpickler, string tweak in readBytes() (#35550)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35550

Avoid clearing data before copying into the string buffer in a few cases.
ghstack-source-id: 101020139

Test Plan: buck test mode/dev-nosan caffe2/test/cpp/jit/...

Differential Revision: D20699725

fbshipit-source-id: 14dce40dbebdd64fd0d60372cad1b642602205db
2020-03-31 17:55:14 -07:00
8d64a3848c [jit] In RPC Server, handle TorchScript continuations asynchronously (#34109)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34109

This change adds glue to GraphExecutor to give the RPC server
access to the future-based Interpreter::runAsync() api.

Previously, if a server encounted a TorchScript continuation-based block
with fork/wait, it would simply block in the server thread until the handler
completed, since it uses the synchronous Interpreter::run() api.

With the ivalue::Future returned by the Interpreter, we can run the
TorchScript code asynchronously from c++ simply by connecting its
callback to the server callback.

We add test cases to cover the new logic, both rpc_async and remote.

ghstack-source-id: 101245438

Test Plan: buck test mode/dev-nosan caffe2/test/distributed/rpc/...

Differential Revision: D20194321

fbshipit-source-id: 16785ec5d9ed0b16cb1ffab0a9771a77de30fcb0
2020-03-31 17:21:46 -07:00
e5746eec1e [ROCm] Remove installation of ca-certificates and apt-transport-https in test.sh (#35676)
Summary:
These packages are now part of the base docker image.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35676

Differential Revision: D20777497

Pulled By: ezyang

fbshipit-source-id: aa9dba905dc376b1462910bc2c4a385d77d7aa0c
2020-03-31 15:36:24 -07:00
9650f465ce [quant][graphmode] Quantization support for at::sort (#35571)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/35571

Test Plan:
python test/test_jit.py

Imported from OSS

Differential Revision: D20769874

fbshipit-source-id: 7d6805754416fd9c4a3d84d42af756e1926111c2
2020-03-31 14:54:16 -07:00
8de01aac0b [Onnxifi] Add initializers to the C2 net passed into Glow (#35764)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35764

So that Glow knows what input is constant.

We probably need to do similar things to torch_glow though.

Test Plan:
```
buck build caffe2/caffe2/opt/custom:glow_net_transform
```

Reviewed By: jackm321

Differential Revision: D20770514

fbshipit-source-id: d398eb8eddbdbba21ccb5b4ac9cb335e4b27b8b3
2020-03-31 14:45:05 -07:00
7d5350c2a3 [easy] ThroughputBenchmark: print out aten's parallel settings before execution (#35632)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35632

This is handy to make sure the settings you have match your expectations. Here is an example output I have got:

```
I0328 15:55:12.336715 41258 throughput_benchmark-inl.h:23] ATen/Parallel:
        at::get_num_threads() : 1
        at::get_num_interop_threads() : 14
OpenMP 201511 (a.k.a. OpenMP 4.5)
        omp_get_max_threads() : 1
Intel(R) Math Kernel Library Version 2020.0.0 Product Build 20191122 for Intel(R) 64 architecture applications
        mkl_get_max_threads() : 1
Intel(R) MKL-DNN v0.21.1 (Git Hash 7d2fd500bc78936d1d648ca713b901012f470dbc)
std:🧵:hardware_concurrency() : 28
Environment variables:
        OMP_NUM_THREADS : 1
        MKL_NUM_THREADS : [not set]
ATen parallel backend: OpenMP
```

Test Plan: Imported from OSS

Differential Revision: D20731331

Pulled By: ezyang

fbshipit-source-id: 5be7ffb23db49b1771c2f563b5d84180c3a0ba7f
2020-03-31 14:25:29 -07:00
07dbf0db46 bfloat16: vectorized clamp, clamp_min and clmap_max (#35082)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/35082

Test Plan: Imported from OSS

Differential Revision: D20721148

Pulled By: ngimel

fbshipit-source-id: 949b66e28bfc6049a891fced9ea308131b3675c6
2020-03-31 14:06:44 -07:00
8e49afa908 Updating submodules
Summary:
GitHub commits:

7235cf5630
faeae13a65
bec10cc357
849da725d3
9dafeb9e64
99dd5d7429
80979f81c7
90d929abd7
cb8e10a1af
99d7165530
70d8d13d0f
1798e56435

Test Plan: n/a

Reviewed By: yns88

fbshipit-source-id: d40ae0d701bd41e30d30610cf381ac4fa2537947
2020-03-31 12:42:55 -07:00
063275fd33 Fix a bug in subgraph rewriters. (#35704)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35704

Due to not clearing nodes_to_delete_, when we try to write graph rewrite
pass with multiple patterns, this is observed:
IndexError: vector::_M_range_check: __n (which is 0) >= this->size() (which is 0)

Test Plan:
The PR stacked on top of this run into this error in the unit test.

Imported from OSS

Differential Revision: D20746593

fbshipit-source-id: 9b55604f49ff2ee2a81a61827880cb679c44607a
2020-03-31 10:52:45 -07:00
f182b43760 [rref] Handle exceptions returned via remote() calls (#35331)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35331

When the function called by remote() throws, it seems sensible to
surface that exeption when rref.to_here() is called.

Doing this only involves simple modifications:
 - we need the OwnerRRef to keep around an optional<string>
   for the error
 - add an OwnerRRef setError() method that's parallel to setValue(),
   and plumb through the logic

We add rpc_tests to verify that the exception is propagated properly.
ghstack-source-id: 101136900

Test Plan:
buck test mode/dev-nosan caffe2/test/distributed/rpc:rpc_spawn
  buck test mode/dev-nosan caffe2/test/distributed/rpc/jit:rpc_spawn

Differential Revision: D20634078

fbshipit-source-id: b5b13fdb85cdf6a43f42347d82eabae1635368ec
2020-03-31 10:06:15 -07:00
b4c4342747 hswish and hardsigmoid: improve docs (#35431)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35431

Resolving z-a-f's comments on earlier PRs on making
the docblocks easier to read.

Test Plan:
render the new docblocks in http://rst.aaroniles.net/

CI

Imported from OSS

Differential Revision: D20658668

fbshipit-source-id: 5ea4a21d6b8dc9d744e2f4ede2f9d5d799fb902f
2020-03-31 10:01:07 -07:00
3d6b5bac0a Move libtorch to py3 and cleanup other CircleCI config (#35700)
Summary:
This moves libtorch to Python 3.6 and cleans up other CircleCI config for the removal of python2.

Going to see if all tests pass on this and will also land before https://github.com/pytorch/pytorch/pull/35677
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35700

Differential Revision: D20767830

Pulled By: orionr

fbshipit-source-id: 0d5a8224b65829cc2b08a5844707e0c0e079421a
2020-03-31 09:39:41 -07:00
8ff05031b0 Update collect_env.py to detect relevant conda-installed numpy and cudatoolkit (#35646)
Summary:
Addresses a small issue noticed in gh-32369

With this PR in a clean conda env where `conda install pytorch torchvision cudatoolkit=10.1 -c pytorch` was run:

```
Versions of relevant libraries:
[pip] numpy==1.18.1
[pip] torch==1.4.0
[pip] torchvision==0.5.0
[conda] blas                      1.0                         mkl
[conda] cudatoolkit               10.1.243             h6bb024c_0
[conda] mkl                       2020.0                      166
[conda] mkl-service               2.3.0            py37he904b0f_0
[conda] mkl_fft                   1.0.15           py37ha843d7b_0
[conda] mkl_random                1.1.0            py37hd6b4f25_0
[conda] numpy                     1.18.1           py37h4f9e942_0
[conda] numpy-base                1.18.1           py37hde5b4d6_1
[conda] pytorch                   1.4.0           py3.7_cuda10.1.243_cudnn7.6.3_0    pytorch
[conda] torchvision               0.5.0                py37_cu101    pytorch
```

With current master:
```
Versions of relevant libraries:
[pip] numpy==1.18.1
[pip] torch==1.4.0
[pip] torchvision==0.5.0
[conda] blas                      1.0                         mkl
[conda] mkl                       2020.0                      166
[conda] mkl-service               2.3.0            py37he904b0f_0
[conda] mkl_fft                   1.0.15           py37ha843d7b_0
[conda] mkl_random                1.1.0            py37hd6b4f25_0
[conda] pytorch                   1.4.0           py3.7_cuda10.1.243_cudnn7.6.3_0    pytorch
[conda] torchvision               0.5.0                py37_cu101    pytorch
```

Note that the conda output will always also include pip-installed packages, so there are still duplicates now. If it's desirable to completely remove the `pip list` output for conda envs, that's also an option.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35646

Differential Revision: D20736304

Pulled By: zou3519

fbshipit-source-id: fb6b3da3f69395869bc8c52bf7a85e9d15b0476d
2020-03-31 09:32:12 -07:00
ada647214f [caffe2] explicitly pass use_offsets=false when calling fbgemm embedding kernels (#35711)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35711

As title

Test Plan: CI

Reviewed By: jianyuh

Differential Revision: D20747290

fbshipit-source-id: fc9fced744cc8f0c61a671cb4b424ff067c2573d
2020-03-31 08:35:19 -07:00
81c2412721 [caffe2] Switch to using `public_include_directories
Summary:
caffe2 uses `-I` all over the place, but really we should use the Buck built-in version of this

Alternatively, the `exported_header` clean up means we need to standardize to a single path

Test Plan:
```
buck build caffe2:torch-cpp-cpu
buck build caffe2/...
```

Reviewed By: malfet

Differential Revision: D19150098

fbshipit-source-id: e99aaf69d6c474afaedbd5f693a7736d3d67aafc
2020-03-31 08:18:44 -07:00
d2343bea32 Disables complex floor, ceil, trunc (to be compatible with NumPy) (#35592)
Summary:
NumPy doesn't allow complex inputs to floor, ceil, or trunc, and without careful deliberation I don't think PyTorch should, either: is it intuitive that these functions apply to both the real and imaginary parts of complex tensors, or only to the real parts?

This PR disables these functions for complex inputs so we don't prematurely commit a particular behavior.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35592

Differential Revision: D20757796

Pulled By: mruberry

fbshipit-source-id: fdc53ac161fca7ad94c9280c3f5cf9c7c40c7f2c
2020-03-31 08:09:09 -07:00
539d3ff344 Revert D20749588: [pytorch][PR] Use std::abs instead of abs in lbfgs.cpp
Test Plan: revert-hammer

Differential Revision:
D20749588

Original commit changeset: b6640af67587

fbshipit-source-id: 730ff95e19d2f222aa11d092fa53f661f3f0d367
2020-03-31 06:50:47 -07:00
59268d4cbf [JIT] Improve the error message when registering a custom class twice (#35568)
Summary:
I hit this exception when including the registration code with `torch::class_` in a header file, which was included in multiple cpp files and thus called this twice. It could be helpful to improve the error msg here to indicate what exactly happened.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35568

Differential Revision: D20759476

Pulled By: rohan-varma

fbshipit-source-id: 680f6a8abb4453cd7a311cda1e2a03f81e7f7442
2020-03-31 00:34:46 -07:00
800d5617c0 Recording of TorchScript functions (#34710)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34710

Extending RecordFunction API to support new recording scopes (such as TorchScript functions), as well as giving more flexibility to set sampling rate.

Test Plan: unit test (test_misc.cpp/testRecordFunction)

Reviewed By: gdankel, dzhulgakov

Differential Revision: D20158523

fbshipit-source-id: a9e0819d21cc06f4952d92d43246587c36137582
2020-03-31 00:33:23 -07:00
8fef8d19fa clang-format (#35752)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/35752

Test Plan: Imported from OSS

Differential Revision: D20761302

Pulled By: suo

fbshipit-source-id: 00624088b96945081889d5ef7be19c115e0328b4
2020-03-31 00:19:40 -07:00
a090de380c [quant][graph] Add quant fusion for dynamic quantization (#35586)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35586

This pass fuses the choose_qparams-quant-dequant sequence
Fusion for weight tensor is the same as static quant.

Test Plan:
python test/test_quantize_script.py

Imported from OSS

Differential Revision: D20755680

fbshipit-source-id: b7443770642b6e6fa0fa9da8a44637e9b2d4df70
2020-03-30 23:34:56 -07:00
1f7ee7b6b7 [quant][graph] Add pass to insert quant dequant for dynamic quantization (#35448)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35448

Add _choose_qparams_per_tensor which returns scale and zero_point similar to the dynamic quantization in the operator

Test Plan:
python test/test_quantize_script.py

Imported from OSS

Differential Revision: D20755679

fbshipit-source-id: c9066d8f1bb3e331809be26c4be806faafc9b981
2020-03-30 23:33:32 -07:00
b0f8429826 Update clang_format.yml 2020-03-30 23:25:16 -07:00
35087b8d77 [Shape Inference] Try to infer input of elementwise ops (#35701)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/35701

Reviewed By: yinghai

Differential Revision: D20745111

fbshipit-source-id: 5fcfe3796c1e80d4cf8843713089d802f3bf3759
2020-03-30 23:15:12 -07:00
4f4ed5c108 Disable c10::import(ns) (#35398)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35398

This disables namespaced c10::import which is broken with custom
mobile op builds.  This is to help prevent people from accidentally
breaking the custom mobile build in a mysterious way; if they use
the longform version it will work.  Fixing the analyzer is tracked
in https://github.com/pytorch/pytorch/issues/35397

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Differential Revision: D20680519

Pulled By: ezyang

fbshipit-source-id: a18ac8df7e72bf399807870beedb828131273e48
2020-03-30 21:12:49 -07:00
726baf69d7 Do not link BLAS into torch_cuda/torch_hip (#35724)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35724

When statically linking BLAS, this results in a second useless copy of
MKL in libtorch_cuda.so

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Differential Revision: D20758165

Pulled By: ezyang

fbshipit-source-id: 5a82a23c053f440b659f2ac2aaaf3c9d5ec69971
2020-03-30 21:04:14 -07:00
cd760fbd7f Updating submodules
Summary:
GitHub commits:

18cf0de640
df3425807f

Test Plan: n/a

Reviewed By: yns88

fbshipit-source-id: 52fb08a9119c9e2933e735cf3415f681070292cc
2020-03-30 20:50:55 -07:00
9018538ab3 [quant][graphmode][refactor] getGeneralTensorInputs(Node*) -> getPassThroughInputs(Value*) (#35558)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35558

This is to have a more fine grained support for general ops,
e.g. for sort, the first output will have pass through inputs and the second op does not need to be quantized
so we'll have a check for that

Test Plan:
.

Imported from OSS

Differential Revision: D20752128

fbshipit-source-id: 825c4c393910a88ecb12e24e9a2f3b05c5d5a7ab
2020-03-30 19:30:37 -07:00
8add1843a9 [quant][graphmode][fix] docs for InsertObservers (#35557)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35557

.

Test Plan:
.

Imported from OSS

Differential Revision: D20752129

fbshipit-source-id: 1cde20675c4f19fd59116332cae4a444e58973c0
2020-03-30 19:29:09 -07:00
c2ca4371ae [PyTorch BC] Clean up whitelist (#35730)
Summary:
Remove stale items
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35730

Reviewed By: hl475

Differential Revision: D20754058

Pulled By: houseroad

fbshipit-source-id: 5f38b34ad68bd7cb6db8cf1654424d655436ca35
2020-03-30 19:24:42 -07:00
2f3b952d16 Use std::abs instead of abs in lbfgs.cpp (#35698)
Summary:
`abs` is a C-style function that takes only integral argument
`std::abs` is polymorphic and can be applied to both integral and floating point types
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35698

Test Plan: CI

Differential Revision: D20749588

Pulled By: malfet

fbshipit-source-id: b6640af67587650786366fe3907384bc8803069f
2020-03-30 18:47:28 -07:00
fdadaf62b0 Disable batch_norm_relu batch_norm3d quanitized ops tests (#35727)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/35727

Differential Revision: D20754874

Pulled By: malfet

fbshipit-source-id: 287aeb62af28e0a17dccfa84e9c5bbb6913853ca
2020-03-30 18:08:56 -07:00
a15a4a5caf Revert D20722426: [pytorch][PR] [doc] Add overflow notice for cuFFT on half precision
Test Plan: revert-hammer

Differential Revision:
D20722426

Original commit changeset: 68f7304de5d6

fbshipit-source-id: 462133d8e8abff2e815a4a9b1eb047e7ecaa041a
2020-03-30 17:52:03 -07:00
56fabface2 fp16 include not needed (#35708)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35708

these are not actually needed and it breaks the normal include guard that selects correct Half implementation

Test Plan: CI green

Reviewed By: malfet

Differential Revision: D20744681

fbshipit-source-id: 70e3667593c987434415ad8ac3b68828875fc3fd
2020-03-30 17:47:44 -07:00
95c1b16fc5 don't replace TensorImpl for inplace min/max dim (#35591)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/35591

Test Plan: buck test mode/dev //caffe2/test:cuda -- 'test_dim_reduction_cpu \(test_torch\.TestTorchDeviceTypeCPU\)'

Differential Revision: D20718321

Pulled By: ngimel

fbshipit-source-id: eba09b37ab1f22463114da69d35982e881dcaa85
2020-03-30 17:15:04 -07:00
46330b368a [5] register aten ops in lite interpreter for detectron2go model (#35248)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35248

register aten ops in lite interpreter for detectron2go models. Also set catchAllKernel for some ops since the model requires different DispatchKey.

(Note: this ignores all push blocking failures!)

Test Plan:
(whole stack)
buck build -c user.ndk_cxxflags='-g1' -c caffe2.expose_op_to_c10=1 //xplat/caffe2/fb/pytorch_predictor:maskrcnnAndroid#android-armv7

Reviewed By: iseeyuan

Differential Revision: D20528762

fbshipit-source-id: 4da4699fe547a63b0c664fe666a8a688f1ab8c6c
2020-03-30 17:07:41 -07:00
8981271d9f Skip test_mm on XLA (#35709)
Summary:
https://github.com/pytorch/pytorch/issues/34891 caused a 15 minute regression in XLA test timing when it inadvertently added this test to XLA -- I think it was intended to only add this test to CUDA.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35709

Test Plan: The XLA test job should return from ~75 to ~60 minutes.

Reviewed By: malfet

Differential Revision: D20748176

Pulled By: yns88

fbshipit-source-id: b50227a35bcbf2915b4f2013e2a4705e905d0118
2020-03-30 16:23:31 -07:00
e021c13d2d [doc] Add overflow notice for cuFFT on half precision (#35594)
Summary:
This would fix https://github.com/pytorch/pytorch/issues/33485.

cc ptrblck
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35594

Differential Revision: D20722426

Pulled By: ngimel

fbshipit-source-id: 68f7304de5d6cecdd9e34e8697fc84bc551b1a45
2020-03-30 15:53:32 -07:00
5e27de021e [rpc] fix backward compatibility test (#35703)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/35703

Test Plan: Imported from OSS

Differential Revision: D20746388

Pulled By: wanchaol

fbshipit-source-id: 4a336fd4a05a606c5b209e5e62c35c70d35614d4
2020-03-30 15:17:51 -07:00
4e19e02976 [quant][graphmode] Quantization support for quantized::add_scalar_relu and quantized::add_scalar_relu_out (#35509)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/35509

Test Plan:
python test/test_jit.py

Imported from OSS

Differential Revision: D20742138

fbshipit-source-id: f6216d0af5da2bd5629aa4909f05dcde7853c8b8
2020-03-30 14:44:38 -07:00
35715a56a9 [reland] Skip OpenMP Thread when OMP_NUM_THREADS is 1 (#35541)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35541

When the OMP_NUM_THREADS is set to 1, we don't need to launch the parallel_for function on an OpenMP thread since there is no intra-op parallelism. By avoiding that, we can reduce the unnecessary context switches.

Test Plan: internal

Reviewed By: ilia-cher

Differential Revision: D20680465

fbshipit-source-id: 4476a810dfe7bf268fcd58fd00afb89ba61644cf
2020-03-30 14:39:57 -07:00
a0dc36e501 [Windows] Fix torch_cuda's forced link (#35659)
Summary:
The current config on `master` yields the following errors when build from source on Windows with CMake and Visual Studio 2019.
```
Severity	Code	Description	Project	File	Line	Suppression State
Error	LNK2001	unresolved external symbol \?warp_size@cuda@at@YAHXZ\	torch	D:\AI\pytorch\build_libtorch\caffe2\LINK	1
Severity	Code	Description	Project	File	Line	Suppression State
Error	LNK1120	1 unresolved externals	torch	D:\AI\pytorch\build_libtorch\bin\Release\torch.dll	1
Severity	Code	Description	Project	File	Line	Suppression State
Error	LNK2001	unresolved external symbol \?warp_size@cuda@at@YAHXZ\	caffe2_observers	D:\AI\pytorch\build_libtorch\modules\observers\LINK	1
Severity	Code	Description	Project	File	Line	Suppression State
Error	LNK1120	1 unresolved externals	caffe2_observers	D:\AI\pytorch\build_libtorch\bin\Release\caffe2_observers.dll	1
Severity	Code	Description	Project	File	Line	Suppression State
Error	LNK2001	unresolved external symbol \?warp_size@cuda@at@YAHXZ\	caffe2_detectron_ops_gpu	D:\AI\pytorch\build_libtorch\modules\detectron\LINK	1
Severity	Code	Description	Project	File	Line	Suppression State
Error	LNK1120	1 unresolved externals	caffe2_detectron_ops_gpu	D:\AI\pytorch\build_libtorch\bin\Release\caffe2_detectron_ops_gpu.dll	1
```

This change at least fixes the above errors in that specific setting. Do you think it makes sense to get this merged or will it break other settings?
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35659

Differential Revision: D20735907

Pulled By: ezyang

fbshipit-source-id: eb8fa1e69aaaa5af2da3a76963ddc910bb716479
2020-03-30 13:59:31 -07:00
639c68b2fe bfloat16: enable basic math function (#35172)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/35172

Test Plan: Imported from OSS

Differential Revision: D20721068

Pulled By: ngimel

fbshipit-source-id: 7e40bda6683f041f04f78739a950cb2a6ac74571
2020-03-30 13:58:11 -07:00
788ef939d8 float2::x and float2::y may not be the same as float on ROCm (#35593)
Summary:
This causes ambiguity and can be triggered sometimes (e.g., by https://github.com/pytorch/pytorch/issues/35217). Explicitly convert them to float.

    error: conditional expression is ambiguous; 'const
    hip_impl::Scalar_accessor<float, Native_vec_, 0>' can be converted to
    'float' and vice versa
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35593

Differential Revision: D20735663

Pulled By: ezyang

fbshipit-source-id: ae6a38a08e59821bae13eb0b9f9bdf21a008d5c0
2020-03-30 13:51:32 -07:00
dd98abb453 Enable splitSparseLengthsSumSparse in onnxifi (#35555)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35555

Att. So that we can lower the SparseLengthsSum* part of SparseLengthsSum*Sparse. We update the tying policy between Gather and SparsLengthsWeightSum* so that we don't bother lowering a single Gather into the backend, which is inefficient to execute on card and creates bubbles between continuous lowering graphs.

Test Plan:
```
buck test glow/fb/test:test_onnxifinnpi
```

Reviewed By: ipiszy

Differential Revision: D20688525

fbshipit-source-id: cb8e38239057ff13a8d385ed09d0d019421de78b
2020-03-30 13:34:59 -07:00
e90e89d189 Transform pass to split SparseLengthsSumSparse (#35522)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35522

We will need to apply this transform pass onto the net before lowering to Glow.

Test Plan:
```
buck test caffe2/caffe2/opt/custom:split_slss_test
```

Reviewed By: ipiszy

Differential Revision: D20688451

fbshipit-source-id: 22c0f5d0dcf97cc51cdc86bfc0abd90328ad5f2c
2020-03-30 13:34:54 -07:00
af4d86788c Split SparseLengthsSumSparse into SparseLengthsSumSparseLookup + SparseLengthsSum (#35507)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35507

We want to split up the SparseLengthsSumSparse op into an indirection op and the SparseLengthsSum op so that we can lower the later part.  The indirection part is a plain impl now.

Test Plan:
```
for i in `seq 10`; do buck test caffe2/caffe2/python/operator_test:lengths_reducer_fused_nbit_rowwise_ops_test -- test_sparse_lengths_sum_rowwise_sparse; done
```

Reviewed By: jspark1105

Differential Revision: D20683478

fbshipit-source-id: 509effe88719d20aa0c4783bbe0ce1f183ee473c
2020-03-30 13:33:29 -07:00
35dbc6ebda [BC] Fix the BC CI (#35692)
Summary:
seems due to skip logic
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35692

Reviewed By: hl475

Differential Revision: D20740522

Pulled By: houseroad

fbshipit-source-id: 779c279f417a2a493ba7bbfd8b090b7792c6d2a8
2020-03-30 13:27:38 -07:00
39d0500434 Fix PyTorch separate compilation (Reland) (#35581)
Summary:
Looks like there is a bug in CUDA device linker, but kernels that uses `thust::sort_by_key` can not be linked with other kernels
Solve the problem by splitting 5 thrust-heavy .cu files into `__torch_cuda_sp` library which is statically linked into `torch_cuda`
For default compilation workflow it should not make any difference.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35581

Test Plan: Compile with `-DCUDA_SEPARABLE_COMPILATION=YES` and observe library size difference: 310Mb before, 173Mb after if compiled for sm_75

Differential Revision: D20741379

Pulled By: malfet

fbshipit-source-id: e9083968324c113e44a39df0de356d79af8e7057
2020-03-30 13:21:57 -07:00
b1f08e7426 Call uncheckedSetDevice in ~InlineDeviceGuard only when device index are different (#35438)
Summary:
Setting device could be expensive, especially when a debugger is present. We should check the device are different before we set.

cc: ptrblck
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35438

Differential Revision: D20664084

Pulled By: ngimel

fbshipit-source-id: 2440b4c9d96c41b4a19d5b1e8e1756fa40f090f0
2020-03-30 13:13:17 -07:00
e7371957cf Report results from CPP unittests on Windows and Linux (Reland) (#35590)
Summary:
Add `--gtest_output=xml:/path/to/artifact-metadata-folder` to scripts invoking unit tests
Add artifacts metadata to windows test jobs
Install `unittest-xml-reporting` and add IN_CIRCLECI environment variable to remote python test results on Windows
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35590

Test Plan: Results should eventually be published to: https://circleci.com/build-insights/gh/pytorch/pytorch/master

Differential Revision: D20742687

Pulled By: malfet

fbshipit-source-id: baae60bdb0a4fb8d4f0d2baa77c65402fa2b99ae
2020-03-30 13:01:45 -07:00
f3151052ce [autograd] fix engine flakiness (#35599)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35599

We don't check if the ready queue was empty before
https://github.com/pytorch/pytorch/pull/33157 because the CPU worker's
queue might not be empty, but after #33157, we try to check if the owner
thread's ready_queue empty after inline exeuction.

This might not always hold true, imagine the following case:

The CPU thread that calls backward() and the GPU device thread, the Graph is like:

GraphRoot(CPU) -> ComputeNode(GPU)

in both thread_main, they are decrementing `--local_graph_task->outstanding_tasks_` to zero together, and then both thread will enter `if (graph_task_completed(local_graph_task))`,  CPU thread will break out and finish and check if local_ready_queue is empty, the GPU thread will send a dummy task to CPU thread ready queue as it think the graph_task finished on its own thread (it actually finished on both threads together). So there will be cases that there's a dummy task remains in the queue.

This happens very rare and non-deterministic, but it might get triggered when we run many jobs in the CI. Remove the check to fix the flakiness

Test Plan: Imported from OSS

Differential Revision: D20739778

Pulled By: wanchaol

fbshipit-source-id: 75a671762650a188f44720625d53f0873617c684
2020-03-30 12:39:39 -07:00
bb32e123e6 Report results of python unit tests during window test runs (#35687)
Summary:
Define `store_test_results` attribute in CircleCI yamls
Install `unittest-xml-reporting` and define `IN_CIRCLECI` environment variable to trigger test runners to save results to XML
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35687

Differential Revision: D20739831

Pulled By: malfet

fbshipit-source-id: 6a7bbf19f93c32766963f5edad191ad8ca316ff8
2020-03-30 12:33:03 -07:00
3f3b96b1f8 Revert D20735881: [pytorch][PR] [WIP] [reland][pytorch][PR] Fix some incorrect annotation…
Test Plan: revert-hammer

Differential Revision:
D20735881

Original commit changeset: d21e940380f0

fbshipit-source-id: fb50a099320bfac92c9b8e1ca12cdc50d302342f
2020-03-30 12:28:27 -07:00
e7a37823b0 [WIP] [reland][pytorch][PR] Fix some incorrect annotation… (#35588)
Summary:
…s found by clang-cl"

This reverts commit a9b540d109aa72e6ba8748019ef1c3ba0d8fac2b.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35588

Differential Revision: D20735881

Pulled By: ezyang

fbshipit-source-id: d21e940380f0c1b9b9b84e9cc892985fd3ad0ac3
2020-03-30 11:42:19 -07:00
3bdc4a37ed CMake script cleanup - mixed case for function names (#35589)
Summary:
Running the following code.
```bash
cmake --help-command-list |
grep -v "cmake version" |
while read c; do
    echo 's/\b'"$(echo $c | tr '[:lower:]' '[:upper:]')"'\(\s*\)(/'"$c"'\1(/g'
done >convert.sed &&
git ls-files -z -- bootstrap '*.cmake' '*.cmake.in' '*CMakeLists.txt' |
egrep -z -v '^(cmake/Modules/|cmake/Modules_CUDA_fix/)' |
xargs -0 sed -i -f convert.sed &&
rm convert.sed
```
cmake-lint is too sensitive about mixed case so I didn't switch the check on.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35589

Differential Revision: D20735648

Pulled By: ezyang

fbshipit-source-id: a09a60a7ce921bb198575a35335faa299bd10b66
2020-03-30 11:37:02 -07:00
bf2b411730 Save results of cpp unittest to test/test-reports folder (#35686)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/35686

Test Plan: CI

Differential Revision: D20739679

Pulled By: malfet

fbshipit-source-id: 813e192a6dec1193a21a69cefd45198c8f1361d1
2020-03-30 11:32:12 -07:00
0eb26fb01e [ROCm] Properly blacklist (#35230)
Summary:
test_python_all_except_nn
+ /usr/bin/python3.6 test/run_test.py --exclude test_nn test_jit_simple
test_jit_legacy test_jit_fuser_legacy --verbose --bring-to-front
test_quantization test_quantized test_quantized_tensor
test_quantized_nn_mods --determine-from=

test_nn continues to be run as part of test1 target

This will allows us to run run_test.py and correctly disabling these sets for ROCm.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35230

Differential Revision: D20735851

Pulled By: ezyang

fbshipit-source-id: 255d21374c9605c8f8b6ffa1b08f58fb10d8e543
2020-03-30 08:57:03 -07:00
e3daf70184 Fix AVX detection with clang-cl (#35653)
Summary:
Defining macros `/D__F16C__` or sth similar won't work on clang-cl.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35653

Differential Revision: D20735878

Pulled By: ezyang

fbshipit-source-id: 392a664b0a9e74222b1a03b8c3f6ebb2c61d867e
2020-03-30 07:53:37 -07:00
340048b67c [quant][graphmode] Remove unused patterns (#35385)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/35385

Test Plan:
python test/test_jit.py

Imported from OSS

Differential Revision: D20655298

fbshipit-source-id: bc5eda2640a809adb55d3d645c65fb02a6f2f444
2020-03-29 23:48:15 -07:00
728c7dcea3 ONNX Update training ops and training amenable export API (#35567)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/35567

Reviewed By: hl475

Differential Revision: D20715339

Pulled By: houseroad

fbshipit-source-id: ad88097e76b169035ab5814b769dc1bed54c6008
2020-03-29 23:14:25 -07:00
1f759936f0 Propagate model id used by Predictor to Caffe2 logging
Summary:
Does the same things as D19658565 but for Caffe2 models.

From investigation https://fb.quip.com/PbgsAEmoJVuf the model id that predictor uses and the model id saved inside the model don't match. Common reason is recurring fluent2 jobs but there are others.

Since model_id from predictor is what the rest of datasets use, it's way more useful imho. I've considered adding both ids, but it'd require additional piping and I don't think it's that useful.

Test Plan: unittests added

Reviewed By: houseroad

Differential Revision: D20630599

fbshipit-source-id: 3e6d0cb0b6f8c8b6ae5935138f55ae7a2ff60653
2020-03-29 23:07:32 -07:00
2c19b53d4f [iOS] Enable selective build for testing FBNet in PyTorchPlayground (#35647)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35647

Since we have enabled the unit test for FBNet on iOS, it'll block people from landing due to the missing support for selective build. This PR adds the missing ops in PyTorchPlaygronud to support FBNet.
ghstack-source-id: 101098537

allow-large-files

Test Plan: - `buck test PyTorchPlayground`

Reviewed By: iseeyuan

Differential Revision: D20723020

fbshipit-source-id: dc4443f50bb39166dbf45ca159bb32d5b45d2eea
2020-03-29 20:42:51 -07:00
9e3605de98 [RELAND] New operator registration API (#35061) (#35629)
Summary:
Reland of https://github.com/pytorch/pytorch/pull/35061 ; removed
the get qualified type name magic from debug strings to work around
MSVC 2017 bug.

Main points of the new API:

- You can register implementations (impl) without having to specify a schema.
- Registrations are commutative, so no matter what order your static
  initializers run, you end up with the same end result.

op_registration_test.cpp contains a reasonably comprehensive accounting
for the available API surface

How does this implementation proceed?  The basic concept is to relax the
internal invariants of Dispatcher data structures to allow the
possibility that a FunctionSchema is not specified in an Operator.

- DispatchKeyExtractor has an uninitialized state where it doesn't look
  for dispatch keys in any arguments of the stack.  It can have a
  schema (de)registered to itself post facto with
  registerSchema/unregisterSchema.
- DispatchTable has a new constructor taking only an OperatorName for
  the uninitialized state.  It can have a schema (de)registered to itself
  post facto with registerSchema/unregisterSchema
- OperatorDef maintains counts of both defs and well as defs_and_impls.
  defs_and_impls keeps track of the outstanding impl registrations; you
  may have impl registrations but no defs.  If there are no defs (no
  schema), the operator is not returned by findSchema.  A new
  findOperatorByName fucntion unconditionally returns the OperatorHandle
  even if there's no schema.  OperatorHandle::hasSchema can be used
  to check if the operator has schema.
- Replaced 'registerKernel' with 'registerImpl', which is the new
  interface for directly registering kernels without implementations.
- Because 'registerImpl' no longer requires an OperatorHandle, change
  'registerDef' to only return a RegistrationHandleRAII.  This is marginally
  less efficient (since we're doing two hash table lookups on a registration
  now), but this won't matter in the long term, and probably doesn't
  matter now either.
- Rename registerBackendFallbackKernel to registerFallback (this exposed
  a bunch of places where we're improperly directly interfacing with Dispatcher;
  we need to add this capability to the true public API)
- All code generated internal registrations are switched to use the new
  API.  This includes VariableType registrations (which previously
  weren't converted) and the mobile autograd stuff
- Switch the new-style def()/impl() APIs to interact directly with Dispatcher,
  rather than indirecting through the old API
- We deleted alias analysis kind merging entirely.  As a nod to BC, it's
  possible to define a full schema with alias analysis kind, and then
  later do another full schema def with missing alias analysis kind, but
  the opposite direction is not allowed.  We can remove this entirely
  following the plan at https://github.com/pytorch/pytorch/issues/35040
- Schema matching is moved inside the dispatcher, because we might not
  be able to immediately schema match at the point of an impl() (because
  we don't have the schema yet).  To do this, we store the inferred
  function schema inside a KernelEntry, so we can check it when we get
  the real schema.
- Registered kernel functions now store a debug string which
  can be used to more easily identify them.  Tests use this to
  distinguish between multiple distinct registrations; regular
  invocations get only very basic information.

Because we need our static initializers to work no matter what order
they're run, the testing strategy on this PR is quite involved.

The general concept:
- Bind a (very gimped) version of the dispatcher API from Python,
  so that we can easily write a more complex testing harness
  using expect tests.
- For series of registrations we want to test, exhaustively
  test every possible permutation of registrations (and
  deregistrations), and show that the intermediate states
  agree no matter what path is taken.
- Intermediate states are rendered using a new dumpState()
  debugging method that prints the internal state of the
  dispatcher.  This method may be generally useful for people
  who want to see what's in the dispatcher.
- Simultaneously, add a new invariant testing function which
  checks that the internal invariants of the dispatcher are
  upheld (so we don't have to print internal implementation
  details of the dispatcher)

The testing framework found a few bugs in development.  For example,
here is a case where we registered schema too early, before checking
if it was valid:

```
Traceback (most recent call last):
  File "test/test_dispatch.py", line 164, in test_def_impl_schema_mismatch
    ], raises=True)
  File "test/test_dispatch.py", line 135, in commute
    results=results, raises=raises)
  File "test/test_dispatch.py", line 83, in run_permutation
    .format(ctor_order[:i], op_ix))
  File "test/test_dispatch.py", line 59, in check_invariants
    .format(expected_provenance, actual_provenance)
AssertionError: 'name[16 chars]ema: (none)\ncatchall: boxed unboxed :: (Tenso[18 chars]0)\n' != 'name[16 chars]ema: test::foo(Tensor x, Tensor y) -> (Tensor)[53 chars]0)\n'
  name: test::foo
- schema: (none)
+ schema: test::foo(Tensor x, Tensor y) -> (Tensor)
  catchall: boxed unboxed :: (Tensor _0) -> (Tensor _0)
 : expected from running ctors (1,); actual from running ctors (1,) and then failing to run ctor 0 (did this failure leave the dispatcher in a wedged state? it shouldn't!)
```

There are also C++ smoketests for the API.  These tests comprehensively
cover the C++ API surface of the new operator registration API, but
don't check very hard if the API does the right thing (that's what
test_dispatch.py is for)

Some miscellaneous changes which could have been split into other
PRs, but I was too lazy to do so:

- Add torch::jit::parseName (mirroring parseSchema/parseSchemaOrName)
- Add cloneWithName functionality to FunctionSchema
- Unconditionally generate schema registration, even when type_method_dispatch
  is a dict.  The one exception is for manual registrations....
- Add fallback, CppFunction::makeFallthrough and
  CppFunction::makeFromBoxedFunction to public API of op_registration, so we can
  stop calling internal registerImpl directly
- Add new syntax sugar dispatch_autograd for registering autograd kernels
- Minor OperatorName cleanup, storing OperatorName in DispatchTable
  and defining operator<< on OperatorName
- Refactored the op registration API to take FunctionSchema directly.
  We now do namespacing by post facto fixing up the OperatorName
  embedded in FunctionSchema.  This also means that you can
  now do torch::import("ns1").def("ns2::blah") and have the ns2
  override ns1 (although maybe this is not the correct behavior.)
- New torch::schema public API, for attaching alias analysis kind
  annotation kinds.  This meant we had to template up some function
  signatures which previously took const char*.  There's now a nice
  comment explaining this strategy.
- torch::import now takes std::string which means we can use
  the namespacing from Python

Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35629

Differential Revision: D20724551

Pulled By: ezyang

fbshipit-source-id: befa46a1affb4ec4ae1fb39e3564a63695a6ca41
2020-03-29 19:48:29 -07:00
86be6443d8 [quant][graphmode] Quantization support for aten::conv3d (#35347)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/35347

Test Plan:
python test/test_jit.py TestJit.test_quantized_conv3d

Imported from OSS

Differential Revision: D20655304

fbshipit-source-id: 2ab6a977eda9064fbb8051669738f37b90f13b6f
2020-03-29 17:39:06 -07:00
e397f87c4b [aten] remove variable set but never used warning (#34015)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34015

Remove warning
```
caffe2/aten/src/ATen/native/cuda/BatchLinearAlgebra.cu(1400): warning: variable "info" was set but never used
```

Test Plan: CI

Reviewed By: jianyuh

Differential Revision: D20181160

fbshipit-source-id: 31d44522a558fe7c2661a84dd6c35eb9d05b757a
2020-03-29 15:29:22 -07:00
77b4e2d2fc [quant][graphmode][fix] Add filter for quantized::add (#35345)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/35345

Test Plan:
python test/test_jit.py

Imported from OSS

Differential Revision: D20655297

fbshipit-source-id: b93791117aaea228ee36c99761b3a46ccd3ea6d1
2020-03-29 14:02:10 -07:00
915a45298c [aten] remove warning on change of sign (#34016)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34016

Remove warning
```
caffe2/aten/src/ATen/native/cuda/Reduce.cuh(654): warning: integer conversion resulted in a change of sign
```

When acc_ptr_ != nullptr , numerator_ and denominator_ must have been initialized.

Other minor changes:
* Make member variables of AccumulationBuffer private
* size_factor_ is used nowhere.

Test Plan: CI

Reviewed By: dskhudia

Differential Revision: D20181169

fbshipit-source-id: e4d023f7fa0692e62be21cfbd971cad8dfb69ea4
2020-03-29 13:29:23 -07:00
dbd2b8bb41 [SigridHashOp] Fix converter (#34836)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34836

Once SigridHashOp argument is supplied, I realized the shape inference is still wrong because the argument is not supplied in the debug_ssa. Thanks to yinghai, I didn't fix the converter, fixing it in this diff

Test Plan:
Run the binary, and checked the exported op

  op {
		input: "sequential_250/parallel/normalization/dper_feature_normalization/sparse_features_processor/sparse_feature_transform/gather_ranges_GSF_IDLIST_COOCCUR_APP_ID_NEKO_ORGANIC_1D_7D_INSTALL_V1/gathered_values_0"
		output: "sequential_250/parallel/normalization/dper_feature_normalization/sparse_features_processor/sparse_feature_transform/sequential_1/hash_feature_ids/SigridHash:0_0"
		type: "SigridHash"
		arg {
			name: "salt"
			i: 0
		}
		arg {
			name: "maxValue"
			i: 100000
		}
		arg {
			name: "hashIntoInt32"
			i: 1
		}
		arg {
			name: "net_pos"
			i: 3
		}
	}

it now have hashIntInt32

Reviewed By: yinghai

Differential Revision: D20457057

fbshipit-source-id: 023ade5e66df82037a8f2da3174383dda8aff230
2020-03-29 13:06:05 -07:00
6fc2403951 [quant][graphmode] qconfig_dict support None (#35336)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/35336

Test Plan:
python test/test_quantization.py

Imported from OSS

Differential Revision: D20655302

fbshipit-source-id: b453f3240ac487aa29629953b4d71274dbbc25fc
2020-03-29 12:47:47 -07:00
64058796e0 clang-format (#35635)
Summary:
TestPlan: CI
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35635

Differential Revision: D20722979

Pulled By: malfet

fbshipit-source-id: fb04905118e2590cbf109170552ac45e510b67a8
2020-03-29 12:22:43 -07:00
860790de88 Makes torch.real and torch.imag NumPy compatible, but disables them for complex tensors (#35560)
Summary:
The current implementations of torch.real and torch.imag are not NumPy compatible. In particular:

- torch.real on a real tensor does not return the real tensor, like contiguous
- torch.real on a complex tensor does not return a real-valued view of the real part
- torch.imag on a complex tensor does not return a real-valued view of the imaginary part
- torch.Tensor.real and torch.Tensor.imag exist as methods, but in NumPy they are writable attributes

This PR makes the functions NumPy compatible by removing the method variants and out kwarg, restricting them to work on only real tensors, and updating the behavior of torch.real to return its input. New tests are added to test_torch.py to verify the behavior, a couple existing complex tests are skipped, and the documentation is updated to reflect the change.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35560

Differential Revision: D20714568

Pulled By: mruberry

fbshipit-source-id: 5dd092f45757b620c8426c829dd15ee997246a26
2020-03-29 02:09:00 -07:00
67c3822944 [quant][graphmode] Make aten::relu a general op (#35420)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35420

This PR makes `aten::relu` a general op that doesn't require observation

This means we also need to change the logic to support skipping intermediate values because
this breaks `conv - relu` pattern if it is not followed by something that is quantizable
since `conv` is quantizable, but we decide to skip observing between conv and relu.

We changed the old `skip_values` to a new `delay_observation_map_` which records information that
allow us to delay the observation of certain values until later points. In the case of `conv - relu`
pattern, we delayed the observation of output of `conv` and observe the output of `relu` instead.

Test Plan:
python test/test_jit.py

Imported from OSS

Differential Revision: D20655309

fbshipit-source-id: 37dbe8a5e2f4cd7582ed67c405f9cf437dd00dbe
2020-03-28 21:29:07 -07:00
efec027653 [quant][graphmode] prepare_script takes original qconfig_dict (#35335)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35335

We'll script the qconfig_dict in `prepare_script`

Test Plan:
regression tests in `python test/test_jit.py`

Imported from OSS

Differential Revision: D20655311

fbshipit-source-id: 002bfd905ff9a9b298a8073d42e12cfffcd1eb71
2020-03-28 18:36:46 -07:00
227beb9095 Revert D20680520: New operator registration API
Test Plan: revert-hammer

Differential Revision:
D20680520

Original commit changeset: 5d39a28e4ec7

fbshipit-source-id: 5b2497ffc24db9a05b01d526f161bc0164f9f707
2020-03-28 14:49:56 -07:00
486277a309 Replace four make_offset_calculator functions with one (#35551)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/35551

Test Plan: Imported from OSS

Differential Revision: D20703649

Pulled By: pbelevich

fbshipit-source-id: 7ab13107dc630de63b3dee776697a96439ffb033
2020-03-28 14:42:26 -07:00
444332710c [quant][graphmode] Quantization support for quantized::add_scalar (#35334)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/35334

Test Plan:
python test/test_jit.py

Imported from OSS

Differential Revision: D20655299

fbshipit-source-id: 66e1fa215a4a40f40dc7abe442c05bb5b6b20cfe
2020-03-28 14:00:44 -07:00
76d5102587 add a cuda/fuser job for legacy graph executor (#35419)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/35419

Differential Revision: D20719013

Pulled By: Krovatkin

fbshipit-source-id: 745d9523a5a9b7b4b556a075351ea58a82501dff
2020-03-28 12:11:18 -07:00
cd00bbc23f clang-format. (#35605)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/35605

Test Plan: Imported from OSS

Reviewed By: orionr

Differential Revision: D20720486

Pulled By: ZolotukhinM

fbshipit-source-id: f081a9fb6ef84fdce3b8f071d5e251e267854a18
2020-03-28 11:45:06 -07:00
28ab8c6ff8 New operator registration API (#35061)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35061

Main points of the new API:

- You can register implementations (impl) without having to specify a schema.
- Registrations are commutative, so no matter what order your static
  initializers run, you end up with the same end result.

op_registration_test.cpp contains a reasonably comprehensive accounting
for the available API surface

How does this implementation proceed?  The basic concept is to relax the
internal invariants of Dispatcher data structures to allow the
possibility that a FunctionSchema is not specified in an Operator.

- DispatchKeyExtractor has an uninitialized state where it doesn't look
  for dispatch keys in any arguments of the stack.  It can have a
  schema (de)registered to itself post facto with
  registerSchema/unregisterSchema.
- DispatchTable has a new constructor taking only an OperatorName for
  the uninitialized state.  It can have a schema (de)registered to itself
  post facto with registerSchema/unregisterSchema
- OperatorDef maintains counts of both defs and well as defs_and_impls.
  defs_and_impls keeps track of the outstanding impl registrations; you
  may have impl registrations but no defs.  If there are no defs (no
  schema), the operator is not returned by findSchema.  A new
  findOperatorByName fucntion unconditionally returns the OperatorHandle
  even if there's no schema.  OperatorHandle::hasSchema can be used
  to check if the operator has schema.
- Replaced 'registerKernel' with 'registerImpl', which is the new
  interface for directly registering kernels without implementations.
- Because 'registerImpl' no longer requires an OperatorHandle, change
  'registerDef' to only return a RegistrationHandleRAII.  This is marginally
  less efficient (since we're doing two hash table lookups on a registration
  now), but this won't matter in the long term, and probably doesn't
  matter now either.
- Rename registerBackendFallbackKernel to registerFallback (this exposed
  a bunch of places where we're improperly directly interfacing with Dispatcher;
  we need to add this capability to the true public API)
- All code generated internal registrations are switched to use the new
  API.  This includes VariableType registrations (which previously
  weren't converted) and the mobile autograd stuff
- Switch the new-style def()/impl() APIs to interact directly with Dispatcher,
  rather than indirecting through the old API
- We deleted alias analysis kind merging entirely.  As a nod to BC, it's
  possible to define a full schema with alias analysis kind, and then
  later do another full schema def with missing alias analysis kind, but
  the opposite direction is not allowed.  We can remove this entirely
  following the plan at https://github.com/pytorch/pytorch/issues/35040
- Schema matching is moved inside the dispatcher, because we might not
  be able to immediately schema match at the point of an impl() (because
  we don't have the schema yet).  To do this, we store the inferred
  function schema inside a KernelEntry, so we can check it when we get
  the real schema.
- Registered kernel functions now store a debug string which
  can be used to more easily identify them.  There's some best
  effort stuff based on __FUNCSIG__ but this is only really
  capable of reporting types and not function symbols.  Tests
  use this to distinguish between multiple distinct registrations.

Because we need our static initializers to work no matter what order
they're run, the testing strategy on this PR is quite involved.

The general concept:
- Bind a (very gimped) version of the dispatcher API from Python,
  so that we can easily write a more complex testing harness
  using expect tests.
- For series of registrations we want to test, exhaustively
  test every possible permutation of registrations (and
  deregistrations), and show that the intermediate states
  agree no matter what path is taken.
- Intermediate states are rendered using a new dumpState()
  debugging method that prints the internal state of the
  dispatcher.  This method may be generally useful for people
  who want to see what's in the dispatcher.
- Simultaneously, add a new invariant testing function which
  checks that the internal invariants of the dispatcher are
  upheld (so we don't have to print internal implementation
  details of the dispatcher)

The testing framework found a few bugs in development.  For example,
here is a case where we registered schema too early, before checking
if it was valid:

```
Traceback (most recent call last):
  File "test/test_dispatch.py", line 164, in test_def_impl_schema_mismatch
    ], raises=True)
  File "test/test_dispatch.py", line 135, in commute
    results=results, raises=raises)
  File "test/test_dispatch.py", line 83, in run_permutation
    .format(ctor_order[:i], op_ix))
  File "test/test_dispatch.py", line 59, in check_invariants
    .format(expected_provenance, actual_provenance)
AssertionError: 'name[16 chars]ema: (none)\ncatchall: boxed unboxed :: (Tenso[18 chars]0)\n' != 'name[16 chars]ema: test::foo(Tensor x, Tensor y) -> (Tensor)[53 chars]0)\n'
  name: test::foo
- schema: (none)
+ schema: test::foo(Tensor x, Tensor y) -> (Tensor)
  catchall: boxed unboxed :: (Tensor _0) -> (Tensor _0)
 : expected from running ctors (1,); actual from running ctors (1,) and then failing to run ctor 0 (did this failure leave the dispatcher in a wedged state? it shouldn't!)
```

There are also C++ smoketests for the API.  These tests comprehensively
cover the C++ API surface of the new operator registration API, but
don't check very hard if the API does the right thing (that's what
test_dispatch.py is for)

Some miscellaneous changes which could have been split into other
PRs, but I was too lazy to do so:

- Add torch::jit::parseName (mirroring parseSchema/parseSchemaOrName)
- Add cloneWithName functionality to FunctionSchema
- Unconditionally generate schema registration, even when type_method_dispatch
  is a dict.  The one exception is for manual registrations....
- Add fallback, CppFunction::makeFallthrough and
  CppFunction::makeFromBoxedFunction to public API of op_registration, so we can
  stop calling internal registerImpl directly
- Add new syntax sugar dispatch_autograd for registering autograd kernels
- Minor OperatorName cleanup, storing OperatorName in DispatchTable
  and defining operator<< on OperatorName
- Refactored the op registration API to take FunctionSchema directly.
  We now do namespacing by post facto fixing up the OperatorName
  embedded in FunctionSchema.  This also means that you can
  now do torch::import("ns1").def("ns2::blah") and have the ns2
  override ns1 (although maybe this is not the correct behavior.)
- New torch::schema public API, for attaching alias analysis kind
  annotation kinds.  This meant we had to template up some function
  signatures which previously took const char*.  There's now a nice
  comment explaining this strategy.
- torch::import now takes std::string which means we can use
  the namespacing from Python

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Differential Revision: D20680520

Pulled By: ezyang

fbshipit-source-id: 5d39a28e4ec7c73fe4b1fb2222e865ab65e188f5
2020-03-28 10:52:49 -07:00
e90c32f11f [quant][graphmode][refactor] Support filter function in quant fusion patterns (#35333)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/35333

Test Plan:
regression tests in: python test/test_jit.py

Imported from OSS

Differential Revision: D20655312

fbshipit-source-id: 50b937bc56aff93f20fe9a0079bf3aec50f6d25d
2020-03-28 08:23:44 -07:00
5557ceb84e Remove pytorch_linux_xenial_py3_5 build and test jobs (#35587)
Summary:
Because Python-3.5 is no longer supported on master
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35587

Differential Revision: D20718522

Pulled By: orionr

fbshipit-source-id: e8fdef044d8861fe062318c686215286fec3808b
2020-03-28 06:35:54 -07:00
5b3492df18 [TensorExpr] Extend arithmetic simplifier to work with multi variable expressions (Attempt 2) (#35415)
Summary:
https://github.com/pytorch/pytorch/pull/35127 was landed and reverted because I missed a test fail (oops). I have found and fixed the issue, which was due to zero terms being introduced after the point that filtered them out (usually required NAN/INF, e.g. x / INF => 0).

See https://github.com/pytorch/pytorch/pull/35127 for more info.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35415

Reviewed By: ZolotukhinM

Differential Revision: D20702957

Pulled By: nickgg

fbshipit-source-id: 119eb41e9fa676bd78e3d1df99297a47ae312185
2020-03-28 00:19:55 -07:00
683246e5ea Improves precision of linspace, logspace (#35461)
Summary:
The Torch algorithms for linspace and logspace conceptually compute each of their values using:

`start_value + step_value * idx`

[And NumPy does the same,](cef4dc9d91/numpy/core/function_base.py (L24)) except NumPy then [sets the last value in its array directly.](cef4dc9d91/numpy/core/function_base.py (L162)) This is because the above computation is unstable when using floats, and NumPy's contract, like PyTorch's, is that the last element in the array is the stop value.

In PyTorch there can be a divergence between the computed last value and the actual value. One user reported case was:

`torch.linspace(-0.031608279794, 0.031531572342, 257, dtype=torch.float32)`

Which causes a difference of 3.7253e-09 between the last value as set by NumPy and computed by PyTorch. After this PR the difference is zero.

Instead of simply setting the last element of the tensor, this PR updates the kernels with a "symmetric" algorithm that sets the first and last array elements without requiring an additional kernel launch on CUDA. The performance impact of this change seems small. I tested with a step sizes of 2^8 and 2^22, and all timing differences were imperceptible except for 2^22 on CPU, which appears to have suffered ~5% slowdown. I think that's an acceptable performance hit for the improved precision when we consider the context of linspace.

An alternative would be to simply set the last element, as NumPy does, on CPU. But I think it's preferable to keep the CPU and CUDA algorithms aligned and keep the algorithm symmetric. In current PyTorch, for example, torch.linspace starts generating values very similar to NumPy, but as the index increases so do the errors, giving our current implementation a "left bias."

Two tests are added to test_torch.py for this behavior. The linspace test will fail on current PyTorch, but the logspace test will succeed since its more complex computation needs wider error bars.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35461

Differential Revision: D20712539

Pulled By: mruberry

fbshipit-source-id: 2c1257c8706f4cdf080ff0331bbf2f7041ab9adf
2020-03-27 23:50:39 -07:00
f1d69cb2f8 [quant][graphmode] Quantization support for permute and repeat_interleave (#35332)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/35332

Test Plan:
python test/test_jit.py

Imported from OSS

Differential Revision: D20655306

fbshipit-source-id: 43dce62ce178d5c7e68b27fd88ed5d2958014c7b
2020-03-27 22:40:25 -07:00
df27b32014 [quant][graphmode] Make interpolate/upsample work again (#35130)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/35130

Test Plan:
python test/test_jit.py TestJit.test_swap_dequantize_all_ops

Imported from OSS

Differential Revision: D20655303

fbshipit-source-id: 5ad8c6de28bcabffdfab4c9bc6a61f19f1d061cc
2020-03-27 22:38:57 -07:00
21c94606b8 Cleans up type conversions, adds CPU test comparing with NumPy (#35374)
Summary:
Per title. Follow-up to https://github.com/pytorch/pytorch/pull/35086.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35374

Differential Revision: D20712443

Pulled By: mruberry

fbshipit-source-id: 987089c14bff644fd6a636da5530dc260e1d1a68
2020-03-27 22:11:57 -07:00
1cc4e5c338 [quant][graphmode] SwapDeQuant support prim::If (#35142)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35142

supporting swap dequant for prim::If nodes, this includes detecting
all blocks of prim::If ends with dequantize, deleting these dequantize
and inserting new dequantize for the output of prim::If

Test Plan:
see next PR that enables swap dequant for interpolate: https://github.com/pytorch/pytorch/pull/35130

Imported from OSS

Differential Revision: D20655307

fbshipit-source-id: 4fd53fbde8e169b7d98251e72ca37a29acdeb295
2020-03-27 21:41:18 -07:00
c672a7340b [quant][graphmode][refactor] getGeneralOpTensorInputIndexes -> getGeneralOpTensorInputs (#35141)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35141

This is preparing for the support of prim::If in SwapDeQuant

Test Plan:
.

Imported from OSS

Differential Revision: D20655300

fbshipit-source-id: 0c66cab37f3f46dd34217a7b99a4d25a159c8487
2020-03-27 19:28:13 -07:00
26b2725167 [quant][graphmode][refactor] swapDeQuant takes block as arugment (#35135)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35135

This in preperation for the support of prim::If in SwapDeQuant

Test Plan:
.

Imported from OSS

Differential Revision: D20655296

fbshipit-source-id: d8507e0020096940e14bc0fb7bde6a22ce706b72
2020-03-27 19:26:12 -07:00
2ef5b947a8 Disable unit test failing on Windows (#35549)
Summary:
Introduce DISABLED_ON_WINDOWS macro, that adds `DISABLED_` prefix to string if compiled for Win32
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35549

Test Plan: CI

Differential Revision: D20700915

Pulled By: malfet

fbshipit-source-id: adddfe2db89b7139093ceef6899862bce0adcf2d
2020-03-27 19:20:29 -07:00
ad1091f753 Fixes default dtype value for onnx hardtanh export (opset11) (#35467)
Summary:
Oneline fix to lara-hdr 's PR https://github.com/pytorch/pytorch/pull/30169.

Default `dtype` value should be set when `dtype is None` rather than when `dtype is not None`.

I didn't make an issue for this as such a small change but I have been using this locally in order to export a model with opset 11 (opset 10 still works).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35467

Differential Revision: D20686048

Pulled By: mruberry

fbshipit-source-id: 726a5f9c0711c7a79b171fe98b602cdef27f9b31
2020-03-27 19:15:42 -07:00
75e4c53b35 [rpc] Add a debug only check to debug python cleanup races (#35395)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35395

as title
ghstack-source-id: 101035263

Test Plan: CI

Differential Revision: D20632634

fbshipit-source-id: 737e353982b325e73da3825b130aae6b11dbcfe7
2020-03-27 18:53:35 -07:00
43928effee [jit] Remove _assert_int_or_pair op (#34509)
Summary:
This one doesn't actually do anything so we don't need an op for it.

It is used inside `torch.nn.functional.unfold` which is already tested

Pull Request resolved: https://github.com/pytorch/pytorch/pull/34509

Pulled By: driazati

Differential Revision: D20676445

fbshipit-source-id: b72d1308bdec593367ec4e14bf9a901d0b62e1cc
2020-03-27 18:37:49 -07:00
a9b540d109 Revert D20670031: [pytorch][PR] Fix some incorrect annotations found by clang-cl
Test Plan: revert-hammer

Differential Revision:
D20670031

Original commit changeset: cd8018dee703

fbshipit-source-id: 6900bf46346f0f415812607e5eff67259fc7b478
2020-03-27 18:26:01 -07:00
238903b7be [jit] Delete polyfill typing (#27510)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/27510

We could delete polyfill typing bc requirements.txt require user to
install typing as a dependency whether in py2 or py3, so those typing
actually not getting used either ways.

Test Plan: Imported from OSS

Differential Revision: D20673393

fbshipit-source-id: ea5276824c6e275c1f991f8c12329040b0058d2b
2020-03-27 18:20:53 -07:00
6d13ef719e Update warning message for autograd issue + XLA backend (#35543)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/35543

Differential Revision: D20704554

Pulled By: ailzhang

fbshipit-source-id: d492f0510b74b3b44bc369c08c32d4b5afc4de7f
2020-03-27 18:16:10 -07:00
76a8d30693 [quant][graphmode] Fold quantized prepacking ops (#35077)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35077

Fold the prepack ops: `quantized::linear_prepack` and `quantized::conv2d_prepack` after
`freeze`

Test Plan:
python test/test_jit.py

Imported from OSS

Differential Revision: D20655301

fbshipit-source-id: fbb4223323f788c88db7b55cfafda46fad106d49
2020-03-27 17:51:51 -07:00
f27403d761 [jit] Fix named tuple resolution (#35409)
Summary:
Fixes #29035

Previously we were missing a case for namedtuples in our Python value resolution logic, so they were just getting resolved as regular Python values, hence the `OSError`s in the linked issue
](https://our.intern.facebook.com/intern/diff/20653496/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35409

Pulled By: driazati

Differential Revision: D20653496

fbshipit-source-id: b5db1a11e918175aa02fda92993d233695417c56
2020-03-27 17:07:26 -07:00
cfcb63de34 custom class method holder should hold a unique_ptr (#35218)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35218

We should express the ownership semantics directly here. Using
`shared_ptr` makes it too easy to leak ownership by inadvertently
storing a copy.

Test Plan: Imported from OSS

Differential Revision: D20682673

Pulled By: suo

fbshipit-source-id: 32002ee515eb8bb7b37e6d0aac3c0695df4eec79
2020-03-27 16:58:40 -07:00
b9adbb5002 Fix/relax CMake linter rules (#35574)
Summary:
Ignore mixed upper-case/lower-case style for now
Fix space between function and its arguments violation
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35574

Test Plan: CI

Differential Revision: D20712969

Pulled By: malfet

fbshipit-source-id: 0012d430aed916b4518599a0b535e82d15721f78
2020-03-27 16:52:33 -07:00
96eec95ece torch.from_numpy for complex dtypes (#35531)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/35531

Differential Revision: D20693581

Pulled By: anjali411

fbshipit-source-id: d53e26b4175452fa00b287efbfceea18104c1364
2020-03-27 14:40:28 -07:00
f101949390 Remove python2 support from setup.py (#35539)
Summary:
As a followup to https://github.com/pytorch/pytorch/pull/35042 this removes python2 from setup.py and adds Python 3.8 to the list of supported versions. We're already testing this in CircleCI.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35539

Differential Revision: D20709060

Pulled By: orionr

fbshipit-source-id: 5d40bc14cb885374fec370fc7c5d3cde8769039a
2020-03-27 14:33:11 -07:00
45c9ed825a Formatting cmake (to lowercase without space for if/elseif/else/endif) (#35521)
Summary:
Running commands:
```bash
shopt -s globstar

sed -e 's/IF (/if(/g' -e 's/IF(/if(/g' -e 's/if (/if(/g' -e 's/ELSE (/else(/g' -e 's/ELSE(/else(/g' -e 's/else (/else(/g' -e 's/ENDif(/endif(/g' -e 's/ELSEif(/elseif(/g' -i CMakeLists.txt
sed -e 's/IF (/if(/g' -e 's/IF(/if(/g' -e 's/if (/if(/g' -e 's/ELSE (/else(/g' -e 's/ELSE(/else(/g' -e 's/else (/else(/g' -e 's/ENDif(/endif(/g' -e 's/ELSEif(/elseif(/g' -i caffe2/**/CMakeLists.txt
sed -e 's/IF (/if(/g' -e 's/IF(/if(/g' -e 's/if (/if(/g' -e 's/ELSE (/else(/g' -e 's/ELSE(/else(/g' -e 's/else (/else(/g' -e 's/ENDif(/endif(/g' -e 's/ELSEif(/elseif(/g' -i torch/**/CMakeLists.txt
sed -e 's/IF (/if(/g' -e 's/IF(/if(/g' -e 's/if (/if(/g' -e 's/ELSE (/else(/g' -e 's/ELSE(/else(/g' -e 's/else (/else(/g' -e 's/ENDif(/endif(/g' -e 's/ELSEif(/elseif(/g' -i c10/**/CMakeLists.txt
sed -e 's/IF (/if(/g' -e 's/IF(/if(/g' -e 's/if (/if(/g' -e 's/ELSE (/else(/g' -e 's/ELSE(/else(/g' -e 's/else (/else(/g' -e 's/ENDif(/endif(/g' -e 's/ELSEif(/elseif(/g' -i cmake/**/*.cmake
sed -e 's/IF (/if(/g' -e 's/IF(/if(/g' -e 's/if (/if(/g' -e 's/ELSE (/else(/g' -e 's/ELSE(/else(/g' -e 's/else (/else(/g' -e 's/ENDif(/endif(/g' -e 's/ELSEif(/elseif(/g' -i cmake/**/*.cmake.in
```
We may further convert all the commands into lowercase according to the following issue: 77543bde41.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35521

Differential Revision: D20704382

Pulled By: malfet

fbshipit-source-id: 42186b9b1660c34428ab7ceb8d3f7a0ced5d2e80
2020-03-27 14:25:17 -07:00
04a3345335 [quant] Make conv2d_prepack and linear_prepack pure (#35073)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35073

We want to do constant propagation for quantize_per_tensor/quantize_per_channel
which will produce results that's consumed by these ops, and since we need to
make sure the output of the node has no writer before constant prop through the node,
the consumer needs to be pure as well.

Test Plan:
see next PR

Imported from OSS

Differential Revision: D20655310

fbshipit-source-id: 3e33662224c21b889c8121b823f8ce0b7da75eed
2020-03-27 14:19:32 -07:00
e1773f2ac0 .circleci: Change default CUDA for pip, cu101 -> cu102 (#35309)
Summary:
So that packages are correctly marked when looking through the html
pages.

Signed-off-by: Eli Uriegas <eliuriegas@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35309

Differential Revision: D20626737

Pulled By: seemethere

fbshipit-source-id: 0fad3d99f0b0086898939fde94ddbbc9861d257e
2020-03-27 14:13:37 -07:00
02d6e6e55f histc: Add a note on elements outside of given bounds (#34889)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/34889

Differential Revision: D20625916

Pulled By: albanD

fbshipit-source-id: febb769f40d86bae8e1c7bb51d719b92bf4a572d
2020-03-27 14:04:51 -07:00
4529d03971 Move test_libtorch from win-test2 to win-test1 group (#35540)
Summary:
Let see if it makes both test branches a bit more balanced
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35540

Test Plan: CI

Differential Revision: D20704642

Pulled By: malfet

fbshipit-source-id: 4e2ab5a80adfe78620206d4eaea30207194379cc
2020-03-27 13:10:53 -07:00
ef511d884b Calls to _empty_affine_quantized pass MemoryFormat by TensorOptions (#34248)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34248

This argument will no longer exist in positional form when MemoryFormat
is moved into TensorOptions by codegen, so we must stop using it when
we make calls from C++.  This diff eliminates all direct positional
calls, making them be passed in using TensorOptions.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Differential Revision: D20683398

Pulled By: bhosmer

fbshipit-source-id: 6928cfca67abb22fbc667ecc2af8453d93489bd6
2020-03-27 13:02:13 -07:00
05e973d673 Add WorkerInfo through TorchBind to make it an available type in TorchScript (#35447)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35447

as titled

Test Plan:
```
buck test mode/dev-nosan //caffe2/test/distributed/rpc/jit:rpc_fork

buck build mode/dev-nosan //caffe2/test/distributed/rpc/jit:rpc_fork && \
buck-out/gen/caffe2/test/distributed/rpc/jit/rpc_fork\#binary.par \
-r test_return_local_script_class_rref_in_py_and_use_in_script
```

Differential Revision: D7923053

fbshipit-source-id: 7b80e0b28aa66343249b8af328ba251314674dcc
2020-03-27 12:41:28 -07:00
835ee34e38 [ROCm] Update to ROCm 3.1.1 (#35552)
Summary:
Redux.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35552

Differential Revision: D20701593

Pulled By: ezyang

fbshipit-source-id: 1946d1e8fb47d597da903bae5d355bf52a5f017f
2020-03-27 12:21:12 -07:00
ff71a4192d Bump base version to 1.6.0a0 (#35495)
Summary:
Since we've done the branch cut for 1.5.0 we should bump nightlies to 1.6.0

Signed-off-by: Eli Uriegas <eliuriegas@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35495

Differential Revision: D20697043

Pulled By: seemethere

fbshipit-source-id: 3646187a5e729994138bf2c68625f25f11430b3a
2020-03-27 12:14:49 -07:00
9e22d15f14 Enable tensorexpr cpp tests in CI. try #2 (#35454)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/35454

Differential Revision: D20665160

Pulled By: Krovatkin

fbshipit-source-id: e04cbe92b2ee5a3288f3c4e5c83533bfea85bf85
2020-03-27 12:09:55 -07:00
930d218fbf Increase Channels Last test coverage (#35504)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/35504

Test Plan: Imported from OSS

Differential Revision: D20682117

Pulled By: VitalyFedyunin

fbshipit-source-id: ddd7ef1f075ea2c5c35df7bd698974fc5c59bc40
2020-03-27 12:04:47 -07:00
3af46c90bd [caffe2] Header path in byte_order.h (#35519)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35519

Fix include of THHalf.h to be TH/THHalf.h. Makes the include consistent with the rest of caffe2.

Test Plan: CI

Differential Revision: D20685997

fbshipit-source-id: 893b6e96e4f1a1e7306ba2e40e4e8ee738f0344f
2020-03-27 11:57:21 -07:00
2c300df2ac [fix] at::print for quantized Tensor (#35545)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35545

Looks like we have never printed a quantized Tensor in cpp before

(Note: this ignores all push blocking failures!)

Test Plan:
.

Imported from OSS

Differential Revision: D20699748

fbshipit-source-id: 9d029815c6e75f626afabf92194154efc83f5545
2020-03-27 11:15:28 -07:00
3cc43bcbb5 Skip slow quanitized tests under ASAN (#35533)
Summary:
Skip tests that take more than finish under a sec normally but take 20+ min under ASAN
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35533

Test Plan: CI

Differential Revision: D20700245

Pulled By: malfet

fbshipit-source-id: 7620b12d3aba1bafb2baa9073fa27c4a0b3dd9eb
2020-03-27 10:55:14 -07:00
0c16cedafe Fix some incorrect annotations found by clang-cl (#35364)
Summary:
Fixes incorrect usages of symbol annotations including:
1. Exporting or importing a function/class in an anonymous namespace.
2. Exporting or importing a function/class implementation in a header file. However, by removing the symbol annotations, they are now local symbols. If they need to be remain global, I can move the implementations to the source file.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35364

Differential Revision: D20670031

Pulled By: ezyang

fbshipit-source-id: cd8018dee703e2424482c27fe9608e040d8105b8
2020-03-27 10:40:04 -07:00
b33e38ec47 Allow a higher-precision step type for Vec256::arange (#34555)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34555

This is sometimes necessary, such as when T=int and the step size is of
type double.

Test Plan: Imported from OSS

Differential Revision: D20687063

Pulled By: ezyang

fbshipit-source-id: 33086d4252d06e7539733a9b1b3d6774e177b6da
2020-03-27 10:22:05 -07:00
5a02930d3a Vectorize (CPU) generic types for binary bitwise operators (#34338)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34338

For those types not optimized for AVX2, this commit would give bitwise
operations on them a boost.

Benchmark (RHEL 7.7, Intel(R) Core(TM) i7-8850H CPU @ 2.60GHz, Turbo off, Release build):

```python
import timeit
for op in ('bitwise_and', 'bitwise_or', 'bitwise_xor'):
    for dtype in ('torch.int8', 'torch.uint8'):
        for n, t in [(10_000, 200000),
                    (100_000, 20000)]:
            print(f'a.{op}_(b), numel() == {n} for {t} times, dtype={dtype}')
            print(timeit.timeit(f'a.{op}_(b)', setup=f'import torch; a = torch.arange(1, {n}, dtype={dtype}); b = torch.arange({n}, 1, -1, dtype={dtype})', number=t))
```

Before:

```
a.bitwise_and_(b), numel() == 10000 for 200000 times, dtype=torch.int8
1.353799690001324
a.bitwise_and_(b), numel() == 100000 for 20000 times, dtype=torch.int8
1.056434961999912
a.bitwise_and_(b), numel() == 10000 for 200000 times, dtype=torch.uint8
1.2957618809996347
a.bitwise_and_(b), numel() == 100000 for 20000 times, dtype=torch.uint8
1.0591609650000464
a.bitwise_or_(b), numel() == 10000 for 200000 times, dtype=torch.int8
1.3113185389993305
a.bitwise_or_(b), numel() == 100000 for 20000 times, dtype=torch.int8
1.0693870880022587
a.bitwise_or_(b), numel() == 10000 for 200000 times, dtype=torch.uint8
1.3075691039994126
a.bitwise_or_(b), numel() == 100000 for 20000 times, dtype=torch.uint8
1.0589785859992844
a.bitwise_xor_(b), numel() == 10000 for 200000 times, dtype=torch.int8
1.3036618039986934
a.bitwise_xor_(b), numel() == 100000 for 20000 times, dtype=torch.int8
1.0595013140009542
a.bitwise_xor_(b), numel() == 10000 for 200000 times, dtype=torch.uint8
1.2947387999993225
a.bitwise_xor_(b), numel() == 100000 for 20000 times, dtype=torch.uint8
1.059969027999614
```

After:

```
a.bitwise_and_(b), numel() == 10000 for 200000 times, dtype=torch.int8
0.9562859639991075
a.bitwise_and_(b), numel() == 100000 for 20000 times, dtype=torch.int8
0.6811799210008758
a.bitwise_and_(b), numel() == 10000 for 200000 times, dtype=torch.uint8
0.9522694869992847
a.bitwise_and_(b), numel() == 100000 for 20000 times, dtype=torch.uint8
0.6815469840003061
a.bitwise_or_(b), numel() == 10000 for 200000 times, dtype=torch.int8
0.8609786279994296
a.bitwise_or_(b), numel() == 100000 for 20000 times, dtype=torch.int8
0.5794818879985542
a.bitwise_or_(b), numel() == 10000 for 200000 times, dtype=torch.uint8
0.8534434389985108
a.bitwise_or_(b), numel() == 100000 for 20000 times, dtype=torch.uint8
0.5764101290005783
a.bitwise_xor_(b), numel() == 10000 for 200000 times, dtype=torch.int8
0.9634105910008657
a.bitwise_xor_(b), numel() == 100000 for 20000 times, dtype=torch.int8
0.6819724230008433
a.bitwise_xor_(b), numel() == 10000 for 200000 times, dtype=torch.uint8
1.0901075929978106
a.bitwise_xor_(b), numel() == 100000 for 20000 times, dtype=torch.uint8
0.816546294001455
```

Test Plan: Imported from OSS

Differential Revision: D20687081

Pulled By: ezyang

fbshipit-source-id: 59b06460430ce181fb761e45a5bdd6379611b391
2020-03-27 10:15:53 -07:00
3c02de0011 copy_ fixed on cuda so removing the workaround in test_many_promotions (#35528)
Summary:
copy_() launch failure fixed on cuda for complex https://github.com/pytorch/pytorch/issues/35344  so removing the workaround added in PR https://github.com/pytorch/pytorch/issues/34093
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35528

Differential Revision: D20693228

Pulled By: anjali411

fbshipit-source-id: dbb6369aa5a21574a0a4fe878ca10e4ecc605f6b
2020-03-27 09:39:46 -07:00
77ad3c5aeb Revert D20683972: [pytorch][PR] Fix PyTorch separate compilation
Test Plan: revert-hammer

Differential Revision:
D20683972

Original commit changeset: bc1492aa9d1d

fbshipit-source-id: 8994cbb36877d4338b8677ac6bc807dd16efa67c
2020-03-27 09:18:48 -07:00
16394a9d3f [caffe2] early return for empty indices in SLS (#35498)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35498

As title

Test Plan:
Need to run remote predictor canary

In SKL T6,

numactl -m 0 -C 3 ./sparse_lengths_sum_benchmark.par -d float -e 100000 --embedding-dim 1 --average-len 0 --batch-size 16 -i 1000000

Before this diff
    0.000302733 ms.        100%. SparseLengthsSum

After this diff
    0.000214509 ms.        100%. SparseLengthsSum

Reviewed By: jianyuh, ellie-wen

Differential Revision: D20678075

fbshipit-source-id: c0c8359036b82ffcbcc8b2a89dfb62db7f0a9c14
2020-03-27 09:10:45 -07:00
25fe7f33ce Add cmakelint to CI (#35525)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/35525

Differential Revision: D20696655

Pulled By: malfet

fbshipit-source-id: 1b15cd730066c8a80440b39110f7f0d51f8ebad0
2020-03-27 09:04:36 -07:00
58f5a89c9a Refactor RoIAlignOp on CPU (#34698)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34698

Refactor RoIAlignOp on CPU

Test Plan: buck test mode/dev-nosan //caffe2/caffe2/python/operator_test:roi_align_rotated_op_test

Reviewed By: houseroad

Differential Revision: D20432434

fbshipit-source-id: 9125eb3bdc83c734222d7d4947c175e3b585afa7
2020-03-27 07:53:58 -07:00
2d023fe6a7 [7] add missing roi_align_rotated op to lite interpreter (#35244)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35244

add roi_align_rotated op to lite interpreter for detectron2go model

(Note: this ignores all push blocking failures!)

Test Plan: try to run model in https://home.fburl.com/~stzpz/text_det/fbnet_300_20/

Reviewed By: iseeyuan

Differential Revision: D20560485

fbshipit-source-id: a81f3a590b9cc5a02d4da676b3cfa52b0e0a68c3
2020-03-27 07:26:02 -07:00
181da12126 Revert D20687652: [pytorch][PR] Report results from cpp unittests on Windows and Linux
Test Plan: revert-hammer

Differential Revision:
D20687652

Original commit changeset: fc370b7e2614

fbshipit-source-id: 8153815c8ed8f3d4f472caa95eda76180b038a42
2020-03-27 06:56:53 -07:00
45e1be9762 Revert D19710370: [pytorch][PR] ONNX Update training ops and training amenable export API
Test Plan: revert-hammer

Differential Revision:
D19710370

Original commit changeset: e5e79d385529

fbshipit-source-id: d0114dc561a3415869805d3fbf43b92730bbcf54
2020-03-27 06:51:05 -07:00
e5cd17cc9e [4] register quantized ops for lite interpreter (#35247)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35247

add a leading "_" to register quantized ops for lite interpreter. They are needed by d2go model

(Note: this ignores all push blocking failures!)

Test Plan:
(whole stack)
buck build -c user.ndk_cxxflags='-g1' -c caffe2.expose_op_to_c10=1 //xplat/caffe2/fb/pytorch_predictor:maskrcnnAndroid#android-armv7

Reviewed By: iseeyuan

Differential Revision: D20528760

fbshipit-source-id: 5b26d075456641b02d82f15a2d19f2266001f23b
2020-03-27 02:26:03 -07:00
025a0abe5a ONNX Update training ops and training amenable export API (#32950)
Summary:
- Update Dropout and Batchnorm in opset 12 : https://github.com/onnx/onnx/pull/2568
- Update api logic for exporting to ONNX training amenable models
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32950

Reviewed By: hl475

Differential Revision: D19710370

Pulled By: houseroad

fbshipit-source-id: e5e79d38552936966662c41d39ddf33be1ba3e35
2020-03-27 00:39:39 -07:00
ac639d927a Reland "[RPC] Use qualified name str directly in RPC torch script code path" (#35489)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35489

Relanding https://github.com/pytorch/pytorch/pull/34733.

Fix is in https://github.com/pytorch/pytorch/pull/34988

Test Plan:
```
buck test mode/dev-nosan //caffe2/test/distributed/rpc:rpc_fork
```

```
buck test mode/dev-nosan //caffe2/test/distributed/rpc/jit:rpc_fork

buck build mode/dev-nosan //caffe2/test/distributed/rpc/jit:rpc_fork && \
buck-out/gen/caffe2/test/distributed/rpc/jit/rpc_fork\#binary.par \
-r test_return_local_script_class_rref_in_py_and_use_in_script

buck build mode/dev-nosan //caffe2/test/distributed/rpc/jit:rpc_fork && \
buck-out/gen/caffe2/test/distributed/rpc/jit/rpc_fork\#binary.par \
-r test_return_local_script_module_rref_in_py_and_use_in_script
```

Differential Revision: D20661748

fbshipit-source-id: d550daab8d689d0a9aa2450f3bdb7417ab79dae2
2020-03-26 23:41:51 -07:00
d2d40c45b6 Report results from cpp unittests on Windows and Linux (#35500)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/35500

Test Plan:
Test in production :)
Results should eventually be published to: https://circleci.com/build-insights/gh/pytorch/pytorch/master

Differential Revision: D20687652

Pulled By: malfet

fbshipit-source-id: fc370b7e261402e14b427f42038ecb2d95bad059
2020-03-26 23:00:33 -07:00
da4e68faed Make operator names consistent between export_opnames and the lite interpreter (#34674)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34674

Two changes to make sure the op_names dumped in export_opnames() are consistent to what are actually used in bytecode.
* Inline graph before dumping the operator names.
* Use code of the graph (which is used in bytecode) instead of the nodes of graph.

Test Plan: Imported from OSS

Differential Revision: D20610715

Pulled By: iseeyuan

fbshipit-source-id: 53fa9c3b36f4f242b7f2b99b421f4adf20d4b1f6
2020-03-26 22:50:59 -07:00
8c90ae11b3 [JIT] fix glow subgraph inputs ordering (#35508)
Summary:
My PR https://github.com/pytorch/pytorch/pull/33020 changed subgraph_utils made subgraph utils non-deterministic by using a set instead of a vector for closed over values. This broke a downstream glow test. We're in the process of working with glow to not rely on the subgraph input order, but in the interim make it ordered again to fix the test.

An alternative is to use a `set` instead of a vector, but I don't particularly like committing to fixed ordering for the subgraph, especially for things like if nodes and while loops where an order doesn't really have any meaning.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35508

Differential Revision: D20683959

Pulled By: eellison

fbshipit-source-id: bb39b29fef2904e52b9dc42be194bb57cbea59c4
2020-03-26 22:44:54 -07:00
bd604cb5b7 Upgrade MKL-DNN to DNNL v1.2 (#32422)
Summary:
## Motivation

This PR upgrades MKL-DNN from v0.20 to DNNL v1.2 and resolves https://github.com/pytorch/pytorch/issues/30300.

DNNL (Deep Neural Network Library) is the new brand of MKL-DNN, which improves performance, quality, and usability over the old version.

This PR focuses on the migration of all existing functionalities, including minor fixes, performance improvement and code clean up. It serves as the cornerstone of our future efforts to accommodate new features like OpenCL support, BF16 training, INT8 inference, etc. and to let the Pytorch community derive more benefits from the Intel Architecture.

<br>

## What's included?

Even DNNL has many breaking changes to the API, we managed to absorb most of them in ideep. This PR contains minimalist changes to the integration code in pytorch. Below is a summary of the changes:

<br>

**General:**

1. Replace op-level allocator with global-registered allocator

```
// before
ideep::sum::compute<AllocForMKLDNN>(scales, {x, y}, z);

// after
ideep::sum::compute(scales, {x, y}, z);
```

The allocator is now being registeted at `aten/src/ATen/native/mkldnn/IDeepRegistration.cpp`. Thereafter all tensors derived from the `cpu_engine` (by default) will use the c10 allocator.

```
RegisterEngineAllocator cpu_alloc(
  ideep::engine::cpu_engine(),
  [](size_t size) {
    return c10::GetAllocator(c10::DeviceType::CPU)->raw_allocate(size);
  },
  [](void* p) {
    c10::GetAllocator(c10::DeviceType::CPU)->raw_deallocate(p);
  }
);
```
------

2. Simplify group convolution

We had such a scenario in convolution where ideep tensor shape mismatched aten tensor: when `groups > 1`, DNNL expects weights tensors to be 5-d with an extra group dimension, e.g. `goihw` instead of `oihw` in 2d conv case.

As shown below, a lot of extra checks came with this difference in shape before. Now we've completely hidden this difference in ideep and all tensors are going to align with pytorch's definition. So we could safely remove these checks from both aten and c2 integration code.

```
// aten/src/ATen/native/mkldnn/Conv.cpp

if (w.ndims() == x.ndims() + 1) {
  AT_ASSERTM(
      groups > 1,
      "Only group _mkldnn_conv2d weights could have been reordered to 5d");
  kernel_size[0] = w.get_dim(0) * w.get_dim(1);
  std::copy_n(
      w.get_dims().cbegin() + 2, x.ndims() - 1, kernel_size.begin() + 1);
} else {
  std::copy_n(w.get_dims().cbegin(), x.ndims(), kernel_size.begin());
}
```

------

3. Enable DNNL built-in cache

Previously, we stored DNNL jitted kernels along with intermediate buffers inside ideep using an LRU cache. Now we are switching to the newly added DNNL built-in cache, and **no longer** caching buffers in order to reduce memory footprint.

This change will be mainly reflected in lower memory usage from memory profiling results. On the code side, we removed couple of lines of `op_key_` that depended on the ideep cache before.

------

4. Use 64-bit integer to denote dimensions

We changed the type of `ideep::dims` from `vector<int32_t>` to `vector<int64_t>`. This renders ideep dims no longer compatible with 32-bit dims used by caffe2. So we use something like `{stride_.begin(), stride_.end()}` to cast parameter `stride_` into a int64 vector.

<br>

**Misc changes in each commit:**

**Commit:** change build options

Some build options were slightly changed, mainly to avoid name collisions with other projects that include DNNL as a subproject. In addition, DNNL built-in cache is enabled by option `DNNL_ENABLE_PRIMITIVE_CACHE`.

Old | New
-- | --
WITH_EXAMPLE | MKLDNN_BUILD_EXAMPLES
WITH_TEST | MKLDNN_BUILD_TESTS
MKLDNN_THREADING | MKLDNN_CPU_RUNTIME
MKLDNN_USE_MKL | N/A (not use MKL anymore)

------

**Commit:** aten reintegration

- aten/src/ATen/native/mkldnn/BinaryOps.cpp

    Implement binary ops using new operation `binary` provided by DNNL

- aten/src/ATen/native/mkldnn/Conv.cpp

    Clean up group convolution checks
    Simplify conv backward integration

- aten/src/ATen/native/mkldnn/MKLDNNConversions.cpp

    Simplify prepacking convolution weights

- test/test_mkldnn.py

    Fixed an issue in conv2d unit test: it didn't check conv results between mkldnn and aten implementation before. Instead, it compared the mkldnn with mkldnn as the default cpu path will also go into mkldnn. Now we use `torch.backends.mkldnn.flags` to fix this issue

- torch/utils/mkldnn.py

    Prepack weight tensor on module `__init__` to achieve better performance significantly

------

**Commit:** caffe2 reintegration

- caffe2/ideep/ideep_utils.h

    Clean up unused type definitions

- caffe2/ideep/operators/adam_op.cc & caffe2/ideep/operators/momentum_sgd_op.cc

   Unify tensor initialization with `ideep::tensor::init`. Obsolete `ideep::tensor::reinit`

- caffe2/ideep/operators/conv_op.cc & caffe2/ideep/operators/quantization/int8_conv_op.cc

    Clean up group convolution checks
    Revamp convolution API

- caffe2/ideep/operators/conv_transpose_op.cc

    Clean up group convolution checks
    Clean up deconv workaround code

------

**Commit:** custom allocator

- Register c10 allocator as mentioned above

<br><br>

## Performance

We tested inference on some common models based on user scenarios, and most performance numbers are either better than or on par with DNNL 0.20.

ratio: new / old | Latency (batch=1 4T) | Throughput (batch=64 56T)
-- | -- | --
pytorch resnet18 | 121.4% | 99.7%
pytorch resnet50 | 123.1% | 106.9%
pytorch resnext101_32x8d | 116.3% | 100.1%
pytorch resnext50_32x4d | 141.9% | 104.4%
pytorch mobilenet_v2 | 163.0% | 105.8%
caffe2 alexnet | 303.0% | 99.2%
caffe2 googlenet-v3 | 101.1% | 99.2%
caffe2 inception-v1 | 102.2% | 101.7%
caffe2 mobilenet-v1 | 356.1% | 253.7%
caffe2 resnet101 | 100.4% | 99.8%
caffe2 resnet152 | 99.8% | 99.8%
caffe2 shufflenet | 141.1% | 69.0% †
caffe2 squeezenet | 98.5% | 99.2%
caffe2 vgg16 | 136.8% | 100.6%
caffe2 googlenet-v3 int8 | 100.0% | 100.7%
caffe2 mobilenet-v1 int8 | 779.2% | 943.0%
caffe2 resnet50 int8 | 99.5% | 95.5%

_Configuration:
Platform: Skylake 8180
Latency Test: 4 threads, warmup 30, iteration 500, batch size 1
Throughput Test: 56 threads, warmup 30, iteration 200, batch size 64_

† Shufflenet is one of the few models that require temp buffers during inference. The performance degradation is an expected issue since we no longer cache any buffer in the ideep. As for the solution, we suggest users opt for caching allocator like **jemalloc** as a drop-in replacement for system allocator in such heavy workloads.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32422

Test Plan:
Perf results: https://our.intern.facebook.com/intern/fblearner/details/177790608?tab=Experiment%20Results

10% improvement for ResNext with avx512, neutral on avx2

More results: https://fb.quip.com/ob10AL0bCDXW#NNNACAUoHJP

Reviewed By: yinghai

Differential Revision: D20381325

Pulled By: dzhulgakov

fbshipit-source-id: 803b906fd89ed8b723c5fcab55039efe3e4bcb77
2020-03-26 22:07:59 -07:00
8240db11e1 [pytorch] Remove python2 support from tests and torch.jit (#35042)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35042

Removing python2 tests and some compat code in torch.jit. Check if dependent projects and external tests have any issues after these changes.

Test Plan: waitforsandcastle

Reviewed By: suo, seemethere

Differential Revision: D18942633

fbshipit-source-id: d76cc41ff20bee147dd8d44d70563c10d8a95a35
2020-03-26 21:29:51 -07:00
98362d11ff [rpc] create error string in listenLoop outside of lock (#35393)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35393

this was being created inside the lock scope, but we don't need to
hold the lock for this.
ghstack-source-id: 100953426

Test Plan: CI

Differential Revision: D20632225

fbshipit-source-id: dbf6746f638b7df5fefd9bbfceaa6b1a542580e2
2020-03-26 20:57:01 -07:00
77bbbf042d [JIT]Support converting str to float. (#35352)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/35352

Differential Revision: D20649286

Pulled By: ailzhang

fbshipit-source-id: e9b09bddd0fe3c962a7514d45fd069cd0b4e6df1
2020-03-26 20:24:59 -07:00
00a261fddd [pytorch] add fallthrough variable kernel for C10_MOBILE (#35491)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35491

The goal of this diff is to avoid having to set AutoNonVariableTypeMode guard
in client code that uses custom mobile build. The guard was necessary because
custom mobile build might not include variable kernels, in which AutoNonVariableTypeMode
guard is usually set. It's hard to enforce all callsites to follow this rule, so
we make this change to simplify it.
Another goal of the diff is to not break FL where real variable kernels are
registered.
ghstack-source-id: 100944553

Test Plan:
- With stacked diff, tested lite-trainer with MnistModel:
```
buck run xplat/caffe2/fb/lite_trainer:lite_trainer \
-c pt.disable_gen_tracing=1 \
-- --model=/home/liujiakai/ptmodels/MnistModel.bc
```
- Will test with the papaya sample app.

Differential Revision: D20643627

fbshipit-source-id: 37ea937919259c183809c2b7acab0741eff84d33
2020-03-26 20:08:05 -07:00
f5383a213f Fix openmp detection with clang-cl (#35365)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/35365

Differential Revision: D20653049

Pulled By: ezyang

fbshipit-source-id: 193c0d956b1aea72b3daa104ef49c4bf167a165a
2020-03-26 19:59:53 -07:00
5371fdb1a0 [C++ API Parity] [Optimizers] Merged Optimizer and LossClosureOptimizer (#34957)
Summary:
1. Removed LossClosureOptimizer, and merged Optimizer into OptimizerBase (and renamed the merged class to Optimizer)
2. Merged the LBFGS-specific serialize test function and the generic test_serialize_optimizer function.
3. BC-compatibility serialization test for LBFGS
4. Removed mentions of parameters_ in optimizer.cpp, de-virtualize all functions
5. Made defaults_ optional argument in all optimizers except SGD

**TODO**: add BC-breaking notes for this PR

Pull Request resolved: https://github.com/pytorch/pytorch/pull/34957

Test Plan: Imported from GitHub, without a `Test Plan:` line.

Differential Revision: D20678162

Pulled By: yf225

fbshipit-source-id: 74e062e42d86dc118f0fbaddd794e438b2eaf35a
2020-03-26 19:53:02 -07:00
e68afe3ab9 [JIT] remove prim::shape op (#34286)
Summary:
Desugar prim::shape to aten::size so that passes don't need to reason about both ops. Serialized models still resolve to `prim::shape` so this doesn't break BC.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34286

Differential Revision: D20316818

Pulled By: eellison

fbshipit-source-id: d1585687212843f51e9396e07c108f5c08017818
2020-03-26 19:29:25 -07:00
8f18cdf2b8 [Autograd Testing] Few refactors to test_autograd.py (#35443)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35443

Addressing Wanchao's comments from
https://github.com/pytorch/pytorch/pull/35268.
ghstack-source-id: 100944390

Test Plan: waitforbuildbot

Differential Revision: D20662292

fbshipit-source-id: d98bf27106e858fe81e0f7755639c7da0f322913
2020-03-26 18:57:52 -07:00
5d9694250c Updating submodules
Summary:
GitHub commits:

6a867586ed
bf0ba207b5
b90f25fcfe
ea2ad0ad00
f32a0cc4a7
23826a3f97
6301dbe7a7
3332b50f59
b6cf025c4f
683abef629
099bb93f87
10214d1d1b
5b848ab61d
a6e81fb889

Test Plan: n/a

Reviewed By: zpao

fbshipit-source-id: cfd395231e68b7d026fce966bcb8cddf10996770
2020-03-26 18:51:35 -07:00
9970be2fd2 Update git-pre-commit (#35511)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/35511

Differential Revision: D20684849

Pulled By: suo

fbshipit-source-id: e059e15230d1a4064f45df5c7895b220c9cc20d9
2020-03-26 18:45:33 -07:00
9b4bbaab53 Add RRef.local_value() for TorchScript (#35433)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35433

Make RRef TorchScript API the same as RRef Python API.

Differential Revision: D7923050

fbshipit-source-id: 62589a429bcaa834b55db6ae8cfb10c0a2ee01ff
2020-03-26 18:06:13 -07:00
d4f3bc7f8e [dt] [caffe2] add/fix shape inference for StumpFunc, SliceGradient and ResizeLike (#35430)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35430

This fixes and adds tests for several commonly used operators.

There's some formatting differences due to running clang-format on one of the files.

Test Plan: buck test //caffe2/caffe2/fb/operators:hypothesis_test //caffe2/caffe2/python/operator_test:utility_ops_test //caffe2/caffe2/python/operator_test:concat_split_op_test

Reviewed By: yyetim

Differential Revision: D20657405

fbshipit-source-id: 51d86d0834003b8ac8d6acb5149ae13d7bbfc6ab
2020-03-26 17:50:32 -07:00
2e739f822b Fix PyTorch separate compilation (#34863)
Summary:
Looks like there is a bug in CUDA device linker, but kernels that uses `thust::sort_by_key` can not be linked with other kernels
    Solve the problem by splitting 5 thrust-heavy .cu files into `__torch_cuda_sp` library which is statically linked into `torch_cuda`
    For default compilation workflow it should not make any difference.

    Test Plan: Compile with `-DCUDA_SEPARABLE_COMPILATION=YES` and observe library size difference: 310Mb before, 173Mb after if compiled for sm_75
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34863

Differential Revision: D20683972

Pulled By: malfet

fbshipit-source-id: bc1492aa9d1d2d21c48e8764a8a7b403feaec5da
2020-03-26 17:49:07 -07:00
2f6f1781af Add warning to a known autograd issue on XLA backend. (#35449)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/35449

Differential Revision: D20676835

Pulled By: ailzhang

fbshipit-source-id: c351eb5650ff09654f7c2e3588dfea19dcde3856
2020-03-26 17:44:12 -07:00
8074779328 [quant][graph] Update dynamic quant tests to use new qconfig (#35451)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35451

default_dynamic_qconfig now holds activation observer

Test Plan:
python test/test_quantize_script.py

Imported from OSS

Differential Revision: D20664585

fbshipit-source-id: 78cb6747705d230d2bbcfdae59210b4b998d0d15
2020-03-26 17:39:49 -07:00
daba68c601 [quant][graph] Add a new observer type for dynamic quantization (#35455)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35455

In graph mode we need to observer the activation tensor for dynamic quantization. This observer should behave the same way as the quantization functions called in the dynamic operator.
Currently for qlinear_dynamic we call quant_utils::ChooseQuantizationParams which has its own logic for calculating scale and zero_point.
We mimic those calculations in the new observer.

Test Plan:
python test/test_quantization.py ObserverTest

Imported from OSS

Differential Revision: D20664586

fbshipit-source-id: e987ea71fff777c21e00c498504e6586e92568a2
2020-03-26 17:38:21 -07:00
086dba3804 [caffe2] move fused SparseAdagrad to open source (#35164)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35164

As title

Test Plan: CI

Reviewed By: jianyuh

Differential Revision: D20581853

fbshipit-source-id: 393ddd9487cd965c465eaa49e1509863618a6048
2020-03-26 17:29:12 -07:00
91e4685514 [modules][caffe2/aten] Fix #include inside of namespace error (#35302)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35302

This is an error in modular builds.

Test Plan: CI

Reviewed By: igorsugak

Differential Revision: D20591224

fbshipit-source-id: 44e8e1be9e54b94f7b54be6bdeb4260a763667ce
2020-03-26 17:17:57 -07:00
618104185b [autograd] enable graph level thread parallelism on CPU (#33157)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33157

This PR enables graph level thread parallelism on CPU for the Autograd
Engine. It replace https://github.com/pytorch/pytorch/pull/29574 for the
reason of task level parallelism drawbacks with the existing autograd
system.

Fixes https://github.com/pytorch/pytorch/issues/18333

The graph level parallelism on CPU design:

1. Remove the single CPU thread that init in the Engine itself and allow
   the owning thread (which calls Engine::execute) to drive the Engine
   execution so that we could let outer threading to enable thread
   parallelism.
2. Maintain a separate ReadyQueue per CPU thread, and stash the
   ReadyQueue for different devices/threads into the thread local
   shared_ptr, the Engine itself will memorize the shared_ptr of the
   ReadyQueue to different devices (other than CPU)
3. The CPU thread local ReadyQueue is initialized per CPU thread
   Engine::execute call (or `backward()`, `grad()` call), and memorized
   the shared_ptr into the GraphTask since every `backward()` call have
   its own GraphTask
4. Cross device NodeTask push is accomplished by 2 and 3. we can refer
   to device's ReadyQueue from Engine, and CPU's ReadyQueue from
   GraphTask, which means if we can push to a different ReadyQueue
   according to the device
5. Termination of the CPU thread: if we mark the graph_task as
   completed, we will exit the while loop and terminate the current
   backward execution, because it's guranteed that all other NodeTasks
   is finished before we mark a GraphTask as complete
6. re-entrant thread logic keeps the same, reentrant thread detection is
   similar as before, we set the worker_device to NO_DEVICE initially
   and set to CPU afterward to detect if this is a reentrant call or not.
7. we still have the reentrant thread pool that create new threads if it's
   a deep reentrant case, and reuse the ReadyQueue with the parent thread
   for performance.

Since we introduce the thread parallelism on CPU, we have to ensure the
thread safety of the GraphTask. This is not a problem if we execute all
forward in different threads since we will build separate GraphTask in
different threads, and each GraphTask is a separate instance that share
nothing, i.e. Hogwild training on CPU should be fine on this case.

But there might be case that user would like to do some part of the task in
a single thread, and do the rest of work in several threads
concurrently, so thread safety is crucial in those cases. The thread
safety strategy for the multithread autograd is as follows:

1. Add a mutex to protect thread safety in Autograd Node/Function, and
   hold the lock for different data racing cases
2. Lock the mutex during Node::apply(), this is to ensure Node that
   writing to the shared variable are not racing across threads (i.e.
   AccumulateGrad and custom C++ Autograd Node if writing to shared
   variables )
3. Lock the mutex during Node::release_variables(), this serve the
   purpose that when we release saved_variables from one thread, no
   other threads can call the Node::apply(), this ensures the variable
   references from other threads aren't dangling.
4. If we don't release any variables and no shared data read/write in
   the Node i.e. purely functional, we don't lock the mutex

This way we could protect the thread safety on Autograd Node, but we
could still not protect the thread safety on Node pre/post C++ hooks
(python hooks are automatically thread safe), we rely on the user to
write thread safe C++ hooks if they want the hook to be correctly
applied in multithreading environment.

**User visiable changes**:
There're not too much user visiable changes, since we use the owning
thread to drive the autograd execution, user could write their own
threading code and does not block on the Autograd engine, some behaviors
that user should be aware of:

**Non-determinism**:
if we are calling backward() on multiple thread concurrently but with
shared inputs (i.e. Hogwild CPU training). Since parameters are automatically shared across threads, gradient accumulation might become non-deterministic on backward calls across threads, because two backward calls might access and try to accumulate the same .grad attribute. This is technically not safe, and it might result in racing condition and the result might be invalid to use.

But this is expected pattern if user are using the multithreading
approach to drive the whole training process but using shared
parameters, user who use multithreading should have the threading model
in mind and should expect this to happen. User should use the functional
interface `torch.autograd.grad()` to calculate the gradients instead of
`backward()` on loss.

**Graph retaining**:
If part of the autograd graph is shared between threads, i.e. run first
part of forward single thread, then run second part in multiple threads,
then the first part of graph is shared. In this case different threads execute grad() or backward() on the same graph might
have issue of destroying the graph on the fly of one thread, and the
other thread will crash in this case. We will error out to the user
similar to what call `backward()` twice with out `retain_graph=True`, and let the user know they should use `retain_graph=True`.

**TODOs**:

[ ] benchmark the PR with example models and datasets to demonstrate
the performance gain in CPU training
[ ] ensure that we don't regress the single thread autograd performance

**Follow ups**:

[ ] a correct and tight integration with distributed autograd
[ ] try to unify the thread pool between JIT and Autograd, and see if
there's unifying pattern that we could apply universally

Test Plan: Imported from OSS

Differential Revision: D20236771

Pulled By: wanchaol

fbshipit-source-id: 1e0bd4eec14ffebeffdb60b763b8d6f0e427eb64
2020-03-26 17:17:52 -07:00
9b8c9d6c72 [autograd] add tests for simple reentrant and stackoverflow escape (#35259)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35259

This PR added tests as part of https://github.com/pytorch/pytorch/issues/34367

It covers:
Re-entrant -> Test simple re-entrant
Re-entrant -> Test stack overflow escape mechanism

Test Plan: Imported from OSS

Differential Revision: D20611828

Pulled By: wanchaol

fbshipit-source-id: 2c55f2a0e3244f11b7153956b0d844e1992e5c80
2020-03-26 17:16:32 -07:00
b0459fec72 [clang-format] Replace asyncio.run with approximation supported in python 3.6 (#35501)
Summary:
**Summary**
`asyncio.run` is supported only after 3.7 and that too, provisionally so.
This commit replaces the use of `asyncio.run` in `tools/clang_format.py`
with an approximation that works in both 3.6 and 3.7.

**Testing**
Ran the script with both `python3.6` and `python3.7`.

```
$ python3.6 tools/clang_format.py --diff
...
Some files not formatted correctly
$
```

```
$ python3.7 tools/clang_format.py --diff
...
Some files not formatted correctly
$
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35501

Differential Revision: D20681947

Pulled By: SplitInfinity

fbshipit-source-id: 43e13aa85f79396bec1f12ee1e80eff90dbed5db
2020-03-26 17:08:46 -07:00
6b1ffcbf59 [model loading] Skip ssaRewrite for predict_net if it has been ssaRewritten (#35428)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35428

ATT

Reviewed By: yinghai

Differential Revision: D20655131

fbshipit-source-id: 4089b3527fc7b83ba793f8d292c7189a0fa68361
2020-03-26 16:48:15 -07:00
b704f30189 [3] register caffe2 mask rcnn ops in lite interpreter (#35246)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35246

register caffe2 mask-rcnn ops in lite interpreter. It requires a leading "_" in the name.

(Note: this ignores all push blocking failures!)

Test Plan: buck build -c caffe2.expose_op_to_c10=1 //xplat/caffe2:mask_rcnn_opsAndroid

Reviewed By: iseeyuan

Differential Revision: D20528758

fbshipit-source-id: 459668a0c6cdc6aec85cb561d7acce2a5291b421
2020-03-26 16:42:29 -07:00
c1f5a54397 Optimize index_select for 1D inputs (#35243)
Summary:
`gather` turns out to be much faster than `index_select` for this function. (Anywhere from 2-10x faster across my testing.) We do have to match the shape for the generated indices, however this does not affect performance since `.expand` does not copy the underlying buffer.

I experimented with a custom kernel, but the improvement over this implementation didn't justify the approach since it would have added significant complexity and reduced the use of shared infrastructure in the PyTorch codebase.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35243

Differential Revision: D20629914

Pulled By: robieta

fbshipit-source-id: 7841b6a40ffd2b32e544f54ef2529904d76864b8
2020-03-26 16:35:11 -07:00
c8bd5ac7e9 [workflows] Don't pipe clang_format.py output to a file (#35496)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35496

This commit modifies the clang-format workflow so that it prints
the output of `tools/clang_format.py` to stdout instead of piping
it to a file. This way, the issues encountered by the script
(e.g. which files are not formatted correctly) will be visible
in the CI window.

Testing:
CI

Test Plan: Imported from OSS

Differential Revision: D20678729

Pulled By: SplitInfinity

fbshipit-source-id: 8b437c2cf2779de0245c1b4301c57b4ee0dcad6d
2020-03-26 15:43:45 -07:00
ea0cab7f46 Guard listener removal add by at::Dispatcher::addListener() with mutex (#35486)
Summary:
Use std::list instead of std::vector to avoid iterating over list of registered listeners
Also, fix formatting

Pull Request resolved: https://github.com/pytorch/pytorch/pull/35486

Differential Revision: D20677764

Pulled By: malfet

fbshipit-source-id: d2a545454a29a12bbbf4aa62d9f8c4029a109e6c
2020-03-26 15:31:12 -07:00
a3e10d2a17 Expose enablement of TensorExpr fuser as env variable (#35341)
Summary:
This commit allows one to use an environment variable to enable the fuser in torch/csrc/jit/tensorexpr/

```
PYTORCH_TENSOREXPR=1 python benchmark.py
```

This commit also changes the registration to happen by default, removing the requirement for the python exposed "_jit_register_tensorexpr_fuser"
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35341

Reviewed By: ZolotukhinM

Differential Revision: D20676348

Pulled By: bwasti

fbshipit-source-id: 4c997cdc310e7567c03905ebff72b3e8a4c2f464
2020-03-26 14:31:57 -07:00
4d39aeec27 Revert D20653072: [pytorch][PR] Add __torch_function__ benchmarks.
Test Plan: revert-hammer

Differential Revision:
D20653072

Original commit changeset: e7e363f8a1b8

fbshipit-source-id: e75e4979399d6fee10e00a673ea45b9bcc0fd447
2020-03-26 13:36:59 -07:00
e00575044e Revert D20657271: [pytorch][PR] [JIT] Optimize before inlining
Test Plan: revert-hammer

Differential Revision:
D20657271

Original commit changeset: 7a9006858c2f

fbshipit-source-id: d77bbc74479ec8ca5d3254eff498e1cbc04add2b
2020-03-26 13:33:44 -07:00
1ff85fc08b Prefer python3 in clang_format (#35490)
Summary:
On most Linux distros `python` still points to python-2.x
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35490

Differential Revision: D20676691

Pulled By: malfet

fbshipit-source-id: 0d4519b83cfebb108edc0628bf036a541247584e
2020-03-26 13:27:52 -07:00
8d720b7034 fix complex conversions on cuda (#35344)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/35225.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35344

Differential Revision: D20650471

Pulled By: ngimel

fbshipit-source-id: f9edabc6dd8884f72c1a38cdf9dbe1de8362535e
2020-03-26 13:17:37 -07:00
bf24753570 Add __torch_function__ benchmarks. (#34645)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/34645

Differential Revision: D20653072

Pulled By: ezyang

fbshipit-source-id: e7e363f8a1b84fc0c354586e266a695e4a2ea60e
2020-03-26 11:29:10 -07:00
61623430d3 [workflows] Add clang-format workflow (#35239)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35239

This commit adds a new GitHub workflow that checks if a pull request
has any formatting issues using `tools/clang_format.py`.

Testing:
Literally in prod.

Test Plan: Imported from OSS

Reviewed By: eellison

Differential Revision: D20605802

Pulled By: SplitInfinity

fbshipit-source-id: 8dd6517dd907d7b6a3d9e9dd3969b666fbebb709
2020-03-26 11:24:56 -07:00
6384c2d81b [JIT] clang-format JIT code (#35115)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35115

This commit runs the newly added tools/clang_format.py on the JIT
codebase and includes all of the formatting changes thus produced.

Testing:
Ran the script, CI.

Test Plan: Imported from OSS

Reviewed By: eellison

Differential Revision: D20568523

Pulled By: SplitInfinity

fbshipit-source-id: e09bdb982ccf090eecfb7c7b461b8d0681eef82b
2020-03-26 11:24:51 -07:00
1422d2cd8b [tools] Replace clang_format.py with clang_format_new.py (#35114)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35114

This commit replaces clang_format.py with clang_format_new.py, a
new and improved script that downloads, verifies and runs a platform-appropriate
clang-format binary on files in a predefined set of whitelisted directories.

Testing:
Ran the script.

Test Plan: Imported from OSS

Reviewed By: eellison

Differential Revision: D20568450

Pulled By: SplitInfinity

fbshipit-source-id: 3bd782dfc211a053c5b419fd4318d38616b5fd16
2020-03-26 11:20:05 -07:00
3b2b6ae1a8 [JIT] Optimize before inlining (#35424)
Summary:
This speeds up the inlining pass of FairSeq model from 180s -> 13s.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35424

Differential Revision: D20657271

Pulled By: eellison

fbshipit-source-id: 7a9006858c2f1b157f5a3f36ed2b3774cc186de8
2020-03-26 11:08:09 -07:00
cd9a357f32 Fix non-deterministic RNG behavior in dist_optimizer tests (#35425)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35425

Prior to this commit, dist_optimizer_test.py uses torch.manual_seed(0)
to set RNG state. However, multiple RPC threads from the same
process share the same RNG instance. Therefore, even though we
reset the RNG state before every torch.rand usage, background RPC
thread could still mess up draw order in the RNG, leading to
non-deterministic behavior. This commit address this problem by
avoid using the default RNG.

Test Plan: Imported from OSS

Differential Revision: D20657589

Pulled By: mrshenli

fbshipit-source-id: 0f45b11a902317f15f3ee8448bc240f5723075a5
2020-03-26 11:01:04 -07:00
3622e1c90f Revert D20589048: [pytorch][PR] [ROCm] Update CI dockers to ROCm release 3.1.1
Test Plan: revert-hammer

Differential Revision:
D20589048

Original commit changeset: 568f40c1b90f

fbshipit-source-id: 724c4fe99e8806f00d2f7dceb71d15a02358f663
2020-03-26 09:31:59 -07:00
ada40777c4 Rand function for complex dtype (#34924)
Summary:
Address https://github.com/pytorch/pytorch/issues/34380
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34924

Differential Revision: D20596623

Pulled By: anjali411

fbshipit-source-id: e17ce069cd763b773399128d113704579ca766e6
2020-03-26 08:34:56 -07:00
17a01c7c7b feature: deterministic random_split (#34043)
Summary:
## 🚀 Feature
Option to provide a seed (random_state) for random_split() like the sklearn API https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html.

## Motivation
Useful for deterministic sampling & reproducible data generation (easily, without affecting the PRNG for other uses).
See https://github.com/pytorch/pytorch/issues/32467
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34043

Differential Revision: D20605678

Pulled By: ezyang

fbshipit-source-id: 12b10bf72cd8a0d4264ae4d326064f806945d011
2020-03-26 08:02:39 -07:00
f7f7c4edd9 [ROCm] Update CI dockers to ROCm release 3.1.1 (#33930)
Summary:
Request to update ROCm CI dockers to release 3.1

Changes required to the PyTorch source base attached:
* switch to the fast path for the Caffe2 ReLU operator
* switch to the new hipMemcpyWithStream(stream) API to replace hipMemcpyAsync(stream) && hipStreamSynchronize(stream) paradigm in an optimized fashion
* disable two regressed unit tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33930

Differential Revision: D20589048

Pulled By: ezyang

fbshipit-source-id: 568f40c1b90f311eb2ba57f02a9901114d8364af
2020-03-26 07:55:44 -07:00
79054495d3 (Fixes #33934) Fix AttributeError for nn.Module's properties (#34324)
Summary:
As described in https://github.com/pytorch/pytorch/issues/33934, the current attribute error in `nn.Module`'s properties are wrong.

```python
from torch import nn

class MyModule(nn.Module):
    property
    def something(self):
        hey = self.unknown_function()
        return hey

model = MyModule()
print(model.something)
```
This raises `AttributeError: 'MyModule' object has no attribute 'something'` when what we want is `AttributeError: MyModule instance has no attribute 'unknown_function'`.

This fixes this issue and will make properties much easier to debug !
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34324

Differential Revision: D20645563

Pulled By: ezyang

fbshipit-source-id: 130f861851bdbef43803569a5ce9e24d2b942179
2020-03-26 07:43:21 -07:00
299bd6d701 Update randomtemp on Windows (#35375)
Summary:
Introduce max retry times to the flaky CUDA build command.
Changes: https://github.com/peterjc123/randomtemp/compare/v0.2...v0.3
Targets https://github.com/pytorch/pytorch/issues/25393#issuecomment-603776413.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35375

Differential Revision: D20653082

Pulled By: ezyang

fbshipit-source-id: a609379af680ac15ad24c9e2e5b330ffba3d1149
2020-03-26 07:41:32 -07:00
4a4e385e13 Revert "Load torch_global_deps for Windows (#35177)" (#35355)
Summary:
This reverts commit d7a7bcb0428273fa54a836b52e750608ebe7e4de.

The previous commit is not useful because torch_global_deps doesn't include any external dependencies.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35355

Differential Revision: D20653036

Pulled By: ezyang

fbshipit-source-id: 6d2e2f90952ca865b27b649a6ff9114ada8ea78c
2020-03-26 07:33:48 -07:00
e0c227d376 Revert D20655246: [jit] add module interface tests to test_jit
Test Plan: revert-hammer

Differential Revision:
D20655246

Original commit changeset: 9e1f865b3f2d

fbshipit-source-id: 241f10738df714efb662f1c53551617dd1558b13
2020-03-26 06:53:19 -07:00
843fd740fb Revert D20645945: [pytorch][PR] [C++ API Parity] [Optimizers] Merged Optimizer and LossClosureOptimizer
Test Plan: revert-hammer

Differential Revision:
D20645945

Original commit changeset: 383588065bf1

fbshipit-source-id: 6d7bc5676de64e329d9862889f32033c76b4009c
2020-03-26 06:40:34 -07:00
de3b2f98db [Shape Inference] Add ssaRewrite pybind func (#35410)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/35410

Reviewed By: yinghai

Differential Revision: D20653042

fbshipit-source-id: 3845413d4e80b9be4fb97dc1eb8e824a55fb7576
2020-03-26 00:46:28 -07:00
d807292c4a [ROCm] Hotfix disable tests (#35396)
Summary:
Regressions introduced sometime these last days - disable for now.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35396

Differential Revision: D20656744

Pulled By: xw285cornell

fbshipit-source-id: 386e4e5d50fb81a1d44e8f3558b81cb69299fe92
2020-03-26 00:21:40 -07:00
be0cdf5d15 [jit] Implement torch::jit::deregisterOperator (#35107)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/35107

Test Plan: CI + `bin/test_jit --gtest_filter=JitTest.CustomOperators --gtest_repeat=2`

Differential Revision: D20650516

Pulled By: malfet

fbshipit-source-id: 4a7d939498588c812319e7c1f432d54e6edf2189
2020-03-25 22:51:30 -07:00
a7c232f74c Port mm cuda from TH to ATen (#34891)
Summary:
Issue https://github.com/pytorch/pytorch/issues/24596

This PR moves `mm` cuda to ATen. The internal `addmmImpl` that was used as the base of the old TH version of `mm` cuda is also ported.

This PR also sets up `addmm` cuda to be fairly easily ported to ATen in a future PR, since TH `mm` and `addmm` used the same `addmmImpl` function at their core.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34891

Differential Revision: D20650713

Pulled By: ngimel

fbshipit-source-id: 692aba1bbae65a18d23855b5e101446082d64c66
2020-03-25 21:42:35 -07:00
fa4603ef36 Also sync submodule in the Dockerfile (#35423)
Summary:
Sometimes submodule URL may have changed between commits. Let Dockerfile
also sync submodules before updating.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35423

Differential Revision: D20658464

Pulled By: ngimel

fbshipit-source-id: 9c101338437f9e86432d3502766858fa5156a800
2020-03-25 21:19:33 -07:00
0ccceb2290 [dist autograd] profile the amount of time spent executing (#35261)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35261

Uses the RECORD_FUNCTION macro to profile the amount of time in
dist_autograd and ensure that it shows up in the profiler output. Since
dist_autograd.backward() is blocking, we can avoid stuffing the RecordFunction
into a callback. This does not support profiling the RPCs that are created when
gradients are forwarded over to other nodes; this can be added in a follow up
diff.
ghstack-source-id: 100723408

Test Plan: Added a UT.

Differential Revision: D20611653

fbshipit-source-id: f9718cf488398a1c7b63ac3841bd2f4549082c8a
2020-03-25 20:57:53 -07:00
7fbb562369 Back out "[reland] Skip OpenMP thread when OMP_NUM_THREADS is set to 1"
Summary:
Original commit changeset: 0d5a6537aa2f
build-break
overriding_review_checks_triggers_an_audit_and_retroactive_review

Test Plan: I will watch WhereIsStable

Differential Revision:
D20662253
Ninja: master broken

fbshipit-source-id: 96eb398dd8f4060f85e76fdfdff6aeb2befccc57
2020-03-25 20:42:11 -07:00
aa01a95c6d Revert D20630760: [pytorch][PR] Enable NNC tests vol. i. add test_tensorexpr.py tests [WIP]
Test Plan: revert-hammer

Differential Revision:
D20630760

Original commit changeset: 7d2f27aca6b1

fbshipit-source-id: 28ac92b3390651a4a67061d6ebf208515b9b9463
2020-03-25 20:34:46 -07:00
2dd867f30f Move normal() to DistributionTemplates (#35167)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35167

The purpose of this PR is to move `normal`/`normal_`/`normal_out` to `native/DistributionTemplates.h`, `native/cpu/DistributionTemplates.h` and `native/cuda/DistributionTemplates.h` to make it reusable for custom RNG, see cpu_rng_test.cpp as an example of custom RNG.

Test Plan: Imported from OSS

Differential Revision: D20588248

Pulled By: pbelevich

fbshipit-source-id: 7ee60be97f81522cd68894ff1389007c05130a60
2020-03-25 19:54:18 -07:00
dc2c4d02f9 Add a wrapper to wrap all optimization for mobile. (#35227)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35227

This wraps.
1. Conv BN folding (not mobile specific)
2. insert XNNPACK conv2d/Linear ops
3. Remove prepacking ops.

Test Plan: Imported from OSS

Differential Revision: D20603562

fbshipit-source-id: ff373af7112c070ec6198bac51845282e09ff1f8
2020-03-25 19:21:14 -07:00
315929f43e Refactor code to move const prop to convolution 2d replacer. (#35226)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35226

..

Test Plan:
python test/test_xnnpack_integration.py

Imported from OSS

Differential Revision: D20603561

fbshipit-source-id: 5905da9e6b91071c2bc28c323f58936746364b70
2020-03-25 19:19:44 -07:00
b4b8b3c0ca Revert D20630988: [quant][graph] Add a new observer type for dynamic quantization
Test Plan: revert-hammer

Differential Revision:
D20630988

Original commit changeset: 7e7aca77590f

fbshipit-source-id: 6bc67ca322c1703004e0053f8eba9b8f6a3a5f67
2020-03-25 18:52:21 -07:00
d7de6ad23f Revert D20640487: [quant][graph] Update dynamic quant tests to use new qconfig
Test Plan: revert-hammer

Differential Revision:
D20640487

Original commit changeset: e0b5cd36fc7a

fbshipit-source-id: 5a3265c7d90cbe848fc53c07365540c54610f481
2020-03-25 18:47:09 -07:00
0a3864f81e Throw an actionable error message on user call rref<ScriptModule>.to_here() in torchscript (#35369)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35369

For issue, https://github.com/pytorch/pytorch/issues/35367

Test Plan:
```
buck build mode/dev-nosan //caffe2/test/distributed/rpc/jit:rpc_fork \
&& buck-out/gen/caffe2/test/distributed/rpc/jit/rpc_fork\#binary.par \
-r test_remote_script_module

buck build mode/dev-nosan //caffe2/test/distributed/rpc/jit:rpc_fork \
&& buck-out/gen/caffe2/test/distributed/rpc/jit/rpc_fork\#binary.par \
-r test_torchscript_functions_not_supported
```

Differential Revision: D7870906

fbshipit-source-id: 2e78f2e620a5cc7c8f26ab35400ba33bb303788d
2020-03-25 18:40:05 -07:00
efbd6b8533 [C++ API Parity] [Optimizers] Merged Optimizer and LossClosureOptimizer (#34957)
Summary:
1. Removed LossClosureOptimizer, and merged Optimizer into OptimizerBase (and renamed the merged class to Optimizer)
2. Merged the LBFGS-specific serialize test function and the generic test_serialize_optimizer function.
3. BC-compatibility serialization test for LBFGS
4. Removed mentions of parameters_ in optimizer.cpp, de-virtualize all functions
5. Made defaults_ optional argument in all optimizers except SGD

**TODO**: add BC-breaking notes for this PR

Pull Request resolved: https://github.com/pytorch/pytorch/pull/34957

Differential Revision: D20645945

Pulled By: yf225

fbshipit-source-id: 383588065bf1859b38f0ad0a25d93d41e153c96e
2020-03-25 18:26:02 -07:00
e08614ffd5 [Autograd Testing] Test failure in parent graph before child reentrant task (#35268)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35268

Context: https://github.com/pytorch/pytorch/issues/34367

Test: Mixed with Errors -> Reentrant on different devices -> Make parent error
before child finishes
ghstack-source-id: 100864838

Test Plan: waitforbuildbot

Differential Revision: D20612919

fbshipit-source-id: 7bfa194820c9b91711590d3719356bc90b5937ef
2020-03-25 17:53:45 -07:00
f3a5081bd4 Enable NNC tests vol. i. add test_tensorexpr.py tests [WIP] (#34897)
Summary:
This  PR add tensorexpr cpp tests to test_jit.py
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34897

Differential Revision: D20630760

Pulled By: Krovatkin

fbshipit-source-id: 7d2f27aca6b1e23e3ffed1c765d8f590688118e3
2020-03-25 17:23:48 -07:00
ccc0e35275 [quant][graphmode] quantization support for prim::CallFunction (#34855)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/34855

Test Plan:
python test/test_jit.py

Imported from OSS

Differential Revision: D20655305

fbshipit-source-id: 44cc3525967048fb9d9c145b342ac7d76b22e4db
2020-03-25 17:17:19 -07:00
64a6faa2c8 [quant][graph] Update dynamic quant tests to use new qconfig (#35325)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35325

default_dynamic_qconfig now holds activation observer

Test Plan:
python test/test_quantize_script.py

Imported from OSS

Differential Revision: D20640487

fbshipit-source-id: e0b5cd36fc7a9c0dcc9020e12901b46008b3ff40
2020-03-25 16:52:17 -07:00
7e24ab8c4a [quant][graph] Add a new observer type for dynamic quantization (#35265)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35265

In graph mode we need to observer the activation tensor for dynamic quantization. This observer should behave the same way as the quantization functions called in the dynamic operator.
Currently for qlinear_dynamic we call quant_utils::ChooseQuantizationParams which has its own logic for calculating scale and zero_point.
We mimic those calculations in the new observer.

Test Plan:
python test/test_quantization.py ObserverTest

Imported from OSS

Differential Revision: D20630988

fbshipit-source-id: 7e7aca77590f965dcb423a705e68d030aaf98550
2020-03-25 16:50:05 -07:00
7580470cc5 Update view op list. (#35399)
Summary:
Adding ops to the list based on our discussion. :D
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35399

Differential Revision: D20651393

Pulled By: ailzhang

fbshipit-source-id: 8cf9026d10c0d74117953dbb68ebc2f537be956a
2020-03-25 16:15:00 -07:00
6bcf0b407b [TensorExpr] Disable fuser-te cuda tests when run on ROCm. (#35388)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/35388

Test Plan: Imported from OSS

Differential Revision: D20648735

Pulled By: ZolotukhinM

fbshipit-source-id: 27bd776fbb84ec81034ace4b874522413d9e5643
2020-03-25 16:04:15 -07:00
d7c255d2fc [jit] add module interface tests to test_jit (#35417)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35417

surprised it's not getting runned by test_jit, added it

Test Plan: Imported from OSS

Differential Revision: D20655246

Pulled By: wanchaol

fbshipit-source-id: 9e1f865b3f2d23b63d4d605aaf2dc3a483a4f0e1
2020-03-25 15:25:28 -07:00
00aa23446b [JIT] [Reland] add complexity tests (#35330)
Summary:
Relanding https://github.com/pytorch/pytorch/pull/34918
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35330

Differential Revision: D20633804

Pulled By: eellison

fbshipit-source-id: ce5cf45f53a25830141bedb759ff712a59a534c7
2020-03-25 14:22:52 -07:00
a4ea16dbc6 Put prim ops used in full jit only in a separate file (#35232)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35232

Some prim operators, like profile and fusion, are not used in mobile (at least in short term). They are coupled with JIT code. Put them in a separate file (register_prim_ops_fulljit.cpp).
ghstack-source-id: 100807055

Test Plan: buck build //xplat/caffe2:torch

Reviewed By: dreiss

Differential Revision: D20408827

fbshipit-source-id: 9013093357cf75723ef00c34bbfdb6b7ea40a4cf
2020-03-25 14:15:34 -07:00
17abb7c31a Add docs to resize_ and resize_as_ (#35392)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/35392

Test Plan: Imported from OSS

Differential Revision: D20650097

Pulled By: VitalyFedyunin

fbshipit-source-id: cff4f555d355dfee42394f6070fe3e466949aeb5
2020-03-25 14:09:25 -07:00
512bcf68be [Formatting] if ( -> if( in CMakeLists.txt (#35343)
Summary:
Same to `else`, `endif` and `elseif`.
Also prefer lowercase over uppercase ones
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35343

Test Plan: None at all

Differential Revision: D20638789

Pulled By: malfet

fbshipit-source-id: 8058075693185e66f5dda7b825b725e139d0d000
2020-03-25 13:48:42 -07:00
c9117f27c4 Fix final callbacks for reentrant backwards (#35066)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35066

Closes #24965

Prior to this commit, final_callbacks_ are cleared on exit of ANY
backward. When using reentrant backward, the last backward would
remove all callbacks from the engine. However, this might lead to
unexpected behavior. For example, the application could install
a final callback after forward, and expecting this callback to fire
when all gradients are ready. If there is a renentrant backward on
a subgraph, it would fire the callback and delete it on exit,
meaning that when fired, not all gradients are ready.

**Failed Attempt**

The 1st attempt was trying to move the callback to the GraphTask
in engine::execute(). However, this failed because more callbacks
could be installed during backward pass.

**Current Solution**

Final callbacks are stored as a member variable in the GraphTask.

* Insertion: use the thread_local current_graph_task to find the
target GraphTask, and append final callback.
* Deletion: final callbacks have the same lifetime as a GraphTask
* Execution: Use the GraphTask provided in the argument to find
final callbacks.

Test Plan: Imported from OSS

Differential Revision: D20546474

Pulled By: mrshenli

fbshipit-source-id: d3f3449bb5af9f8703bcae63e6b52056cd535f11
2020-03-25 13:47:06 -07:00
aadd0fda8b Document reduce_scatter collective operation (#35274)
Summary:
I don't know why reduce_scatter collective operation is not documented so I add it to the document.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35274

Differential Revision: D20645850

Pulled By: mrshenli

fbshipit-source-id: 0a4458bff1a4e15a4593dd4dcc25e4e0f6e2265d
2020-03-25 13:36:29 -07:00
40b244ceb4 Fix handling of non-finite values in topk (#35253)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/34191

`at::native::radixSelect` basically uses integer comparison which creates a defined ordering of non-finite float values. This isn't compatible with IEEE float comparison, so mixing the two leads to unwritten values in the output.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35253

Differential Revision: D20645554

Pulled By: ezyang

fbshipit-source-id: 651bcb1742ed67086ec89cc318d862caae65b981
2020-03-25 13:29:45 -07:00
de3044b210 Load all DLLs in the lib directory for Windows (#35362)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/35358.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35362

Differential Revision: D20645218

Pulled By: ezyang

fbshipit-source-id: 08ef5889fe2cd9139a3f6852ee73fe7742b315b5
2020-03-25 13:22:45 -07:00
34b005954e Support merge_fp32_inputs_into_fp16 for predefined partitions (#35361)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35361

If the inputs we are bundling together will be consumed by ops from the same partition, we can assign the Split and Half2Float ops to the that partition too. Otherwise, we do nothing.

Reviewed By: bangshengtang

Differential Revision: D20639777

fbshipit-source-id: 4032abb9178f3b44a85e4789ddf5ad5624245e3a
2020-03-25 12:53:48 -07:00
d863fe356d Ignore rest of outputs of LayerNorm when lowering to Glow (#35338)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35338

Pull Request resolved: https://github.com/pytorch/glow/pull/4343

- Ignore rest of outputs of layoutNorm
- Support implicit broadcast when loading basic binary ops from Glow.

Test Plan:
```
buck test glow/fb/test:test_onnxifinnpi -- test_layernorm_mul --print-passing-details
```

Reviewed By: jfix71

Differential Revision: D20627768

fbshipit-source-id: 30ae8a2f590452f0b354d413ae2c5ec46a4a77d8
2020-03-25 12:52:25 -07:00
15e5453977 [reland][quant][graphmode] Add quantization support for aten::cat (#34346) (#35337)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/35337

Test Plan: python test/test_jit.py

Differential Revision: D20648201

Pulled By: jerryzh168

fbshipit-source-id: f6570c3ee2f48a9bc6373d2af873824ac2c8ef62
2020-03-25 12:45:21 -07:00
51818cc4ea [TensorExpr] Cleanup implementation of alloc/free insertion. (#35176)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/35176

Differential Revision: D20585574

Test Plan: Imported from OSS

Pulled By: ZolotukhinM

fbshipit-source-id: 1fd4330e8d7c1c2bceecc2ba927bcc455f5d858f
2020-03-25 11:51:21 -07:00
db0f715af6 [TensorExpr] Factor out LoopNest::insertAllocFree. (#35175)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/35175

Differential Revision: D20585576

Test Plan: Imported from OSS

Pulled By: ZolotukhinM

fbshipit-source-id: 498b7ddf44df11392f6b5454387a29c5457bdb05
2020-03-25 11:51:16 -07:00
ceb4ed3733 [TensorExpr] Methods name cleanup in LoopNest class. (#35174)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/35174

Differential Revision: D20585575

Test Plan: Imported from OSS

Pulled By: ZolotukhinM

fbshipit-source-id: 0fa8e1e85e1502b9a86cf34608cb791ffb23d395
2020-03-25 11:51:11 -07:00
450738662b [TensorExpr] Replace ExprHandle with const Expr* in Substitute. (#35173)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/35173

Differential Revision: D20585577

Test Plan: Imported from OSS

Pulled By: ZolotukhinM

fbshipit-source-id: 902f9740a0b97c3d2a0eef2c274d8227b975b3cb
2020-03-25 11:48:14 -07:00
5959bd6c29 Making sure all tensors in torch.cat sequence have the same dtype. (#35150)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35150

Fixes #35014

Test Plan: Imported from OSS

Differential Revision: D20578589

Pulled By: z-a-f

fbshipit-source-id: edeaef133d1cf5152dcbafab2b969f1424ee2836
2020-03-25 11:36:12 -07:00
7cb301e48d [rpc][easy] remove code duplication on ProcessGroupAgent::enqueueSend (#35311)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35311

This must have snuck in since a couple PRs updated this same area and
the merge conflict was not resolved properly.
ghstack-source-id: 100770387

Test Plan: CI

Differential Revision: D20602683

fbshipit-source-id: 22134069194b4095dd3be920e4e7f4437dac06f0
2020-03-25 11:28:12 -07:00
7e327e1210 Enable Constant Folding for ONNX Opset 12 (#34823)
Summary:
Currently constant folding is only enabled for ONNX opset versions 9 to 11. This PR enables it for the new ONNX opset 12.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34823

Reviewed By: hl475

Differential Revision: D20627629

Pulled By: houseroad

fbshipit-source-id: 7501d8ab8295751c0e9a02752d8908a35d8a0454
2020-03-25 11:06:39 -07:00
032c27cff7 [quant][graph] Add _choose_qparams function for graph mode (#35235)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35235

For dynamic quantization in graph mode, we need an operator that returns the qparams of the tensor
similar to the linear_dynamic quantized op

Test Plan:
python test/test_quantized_tensor.py TestQuantizedTensor.test_choose_qparams

Imported from OSS

Differential Revision: D20608793

fbshipit-source-id: b923b2620421b32d05f4097db0d6153d53198221
2020-03-25 10:33:21 -07:00
f9889aa390 [reland] Skip OpenMP thread when OMP_NUM_THREADS is set to 1 (#35353)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35353

Pull Request resolved: https://github.com/pytorch/pytorch/pull/35324

When the OMP_NUM_THREADS is set to 1, we don't need to launch the parallel_for function on an OpenMP thread since there is no intra-op parallelism. By avoiding that, we can reduce the unnecessary context switches.

Test Plan: internal

Reviewed By: ilia-cher

Differential Revision: D20638734

fbshipit-source-id: 0d5a6537aa2fc35d8d0904c3b9e734e52585eee7
2020-03-25 10:06:57 -07:00
3645d9b832 Port diag cpu from TH to ATen (#35100)
Summary:
Fixes: https://github.com/pytorch/pytorch/issues/24689
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35100

Differential Revision: D20624868

Pulled By: VitalyFedyunin

fbshipit-source-id: bc436a62369aa9b6257e82051eabf5768652cf58
2020-03-25 09:42:53 -07:00
53fceff1e1 Change weight scale test to cpu only (#35346)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35346

weight scale op doesn't have GPU impl. This is breaking OSS CI from D20506032. Making it cpu only

Test Plan: OSS CI

Reviewed By: ustctf

Differential Revision: D20637440

fbshipit-source-id: 9aa6cce63ce637ab7856788e5d02f527decb2a26
2020-03-25 09:18:58 -07:00
c73e97033a Added type promotion logic for complex numbers (#34093)
Summary:
Issue: https://github.com/pytorch/pytorch/issues/33780
After this PR:
1. dtype promotion logic will correctly work for ops involving complex scalars
2. added alias for complex64 (cfloat) and complex128 (cdouble)
3. added an internal function get_complex_default_dtype (consciously not exposed in public API)
   - sets the default complex dtype to be double if default_dtype is set to double, else float https://github.com/pytorch/pytorch/pull/34093#discussion_r392350224
>>> 1j*torch.ones(2)
tensor([(0.0000 + 1.0000j), (0.0000 + 1.0000j)], dtype=torch.complex64)

>>> torch.set_default_dtype(torch.float64)
>>> 1j*torch.ones(2)
tensor([(0.0000 + 1.0000j), (0.0000 + 1.0000j)], dtype=torch.complex128)

>>> 1j + torch.ones(2)
tensor([(1.0000 + 1.0000j), (1.0000 + 1.0000j)], dtype=torch.complex128)

>>> torch.tensor(1j) + torch.ones(2,2)
tensor([[(1.0000 + 1.0000j), (1.0000 + 1.0000j)],
        [(1.0000 + 1.0000j), (1.0000 + 1.0000j)]], dtype=torch.complex128)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34093

Differential Revision: D20537125

Pulled By: anjali411

fbshipit-source-id: 05fb1f81b8ba039d0b698cdd2c0bbf8b0ce0b767
2020-03-25 09:12:21 -07:00
361eed6a6e Use JIT op registration directly for lite interpreter. (#34070)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34070

The first step to make all operators available for lite interpreter. The original code used manual registration for lite interpreter ops with a "_" prefix, for two reasons:
1. To minimize the build size.
2. To avoid duplicate registration in OSS (majorly feature testing and unit tests).

Now since we have more and more models to support, the manual registration way is not practical. To make this process automatic while keeping the binary size under control, we plan to:
1. Make all necessary ops callable from lite interpreter.
2. The binary size would be increased because of step 1. Use ljk53 's custom build to selectively build the binary with ops used in specific models. The ops will be automatically collected using get_opnames.
3. The temporary "register_mobile_ops.cpp" can be removed.

Test Plan: Imported from OSS

Differential Revision: D20291596

Pulled By: iseeyuan

fbshipit-source-id: 553b4699619cd71fea20658f3bc8c2d48852ef5c
2020-03-25 07:21:51 -07:00
3789db40f2 [aibench] added support for measuring memory on AI Bench for Caffe2 Models (#35036)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35036

Exposing the helper functions in benchmark_helper.h

Reviewed By: kimishpatel, geof90

Differential Revision: D20528983

fbshipit-source-id: 73231becd93b1e700d37af425bebb628890dec9a
2020-03-25 01:58:18 -07:00
c2804e8229 Fix Caffe2 mobile compilation (#35288)
Summary:
fixes https://github.com/pytorch/pytorch/issues/35211
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35288

Reviewed By: jianyuh

Differential Revision: D20639692

Pulled By: jspark1105

fbshipit-source-id: 83e12c1c956271c10ffba197206bd8d5a158e700
2020-03-25 01:43:00 -07:00
d6149a7250 move some ops to contrib (#35282)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35282

move some ops to contrib: sls, batchnorm

Test Plan: https://our.intern.facebook.com/intern/testinfra/testrun/7599824379495916

Reviewed By: yinghai

Differential Revision: D20616678

fbshipit-source-id: 6dd8733f89f3932b4fdb127630dce132b9a60ebc
2020-03-25 00:13:50 -07:00
d6377b7cef Fix thread_local initializtion in C10 WarningHandler. (#34822)
Summary:
The Windows + MSVC-specific bug discussed here: https://github.com/pytorch/pytorch/issues/19394 and fixed here: https://github.com/pytorch/pytorch/issues/22405 still appears in C10's warning handler class. This results in a crash if a user attempts to run code which would print a warning when that code is running inside a thread created by a DLL. This PR applies a similar fix to that of https://github.com/pytorch/pytorch/issues/22405.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34822

Test Plan:
* Tested locally by running CodecverseWorkbench Unity app with patched build.
* CI

Differential Revision: D20627971

Pulled By: HapeMask

fbshipit-source-id: 64dfca531ed7eebbe9e0ecac3d3d4d025c683883
2020-03-24 23:52:23 -07:00
f090031e69 [JIT] remove list appends (#33199)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33199

Remove list appends when we can match them with a list construction. This helps create a larger functional graph

Test Plan: Imported from OSS

Differential Revision: D20603187

Pulled By: eellison

fbshipit-source-id: a60e933b457479d40960994d8ffdf39ef49eaf6e
2020-03-24 23:46:03 -07:00
aab4beb87f [JIT] Pass To Safely Remove Aten Inplace Ops (#33186)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33186

This helps create larger functional graphs. It has the potential to increase memory use, so in order to land this on by default we would probably also do a reuse of buffers pass.

This is currently O(n * | Removed Nodes | ) because we have to rebuild the alias Db each time we make a change. This pass is critical to creating functional graphs, so this might be a compelling use case to build incremental updates to alias Db.

Test Plan: Imported from OSS

Differential Revision: D20603189

Pulled By: eellison

fbshipit-source-id: 105db52bf38e02188ca6df6d36294466d3309a0a
2020-03-24 23:45:58 -07:00
5b2f8cef08 [JIT] Functional Graph Pass (#33020)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33020

This is a pass to create functional blocks. The other PRs in the stack help avoid some of the limitations that are are often found in graphs. It's possible that this would work well with a graph that is frozen. Follow up work items that will help this pass:

- We don't currently have any capacity in alias analysis to tell whether a Value that came from the wildcard set "re-escapes" back into the wildcard set.
- More comments on the semantics of the graph and correctness conditions
- We could consider using dynamic dag if the perf of this is a limitation.
- potential make Functional Graphs Functional Blocks instead, so that we do not repeatedly copy constants, also to make IR read easier.

Test Plan: Imported from OSS

Differential Revision: D20603188

Pulled By: eellison

fbshipit-source-id: 6822a6e65f4cc2676f8f6445fe8aa1cb858ebeeb
2020-03-24 23:44:18 -07:00
01a7d6adcb [caffe2] Fix typo in dataset_ops (#35356)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35356

Fix a few typos in dataset_ops

(Note: this ignores all push blocking failures!)

Test Plan: .

Reviewed By: yinghai

Differential Revision: D20554176

fbshipit-source-id: 8565f4b34f5d304696adb1c06d4596921938de8f
2020-03-24 22:24:46 -07:00
93065ff767 [1] add missing header for C10_EXPORT_CAFFE2_OP_TO_C10 (#35245)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35245

add missing header file for C10_EXPORT_CAFFE2_OP_TO_C10_CPU micro

(Note: this ignores all push blocking failures!)

Test Plan: buck build -c caffe2.expose_op_to_c10=1 //xplat/caffe2:mask_rcnn_opsAndroid

Reviewed By: dreiss

Differential Revision: D20528761

fbshipit-source-id: 7cd186ba72964c2e193aca994f87a91a71c3c5d7
2020-03-24 22:16:03 -07:00
6c39e362fd Minor fix to quantized conv docstring (#35134)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/35134

Test Plan: Imported from OSS

Differential Revision: D20571996

Pulled By: z-a-f

fbshipit-source-id: f5bb12e5779bed24c4e0e2f9e2ce90c8d628bd30
2020-03-24 20:04:50 -07:00
74c02619de quantized Conv1d (#35093)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/35093

Test Plan: Imported from OSS

Differential Revision: D20555070

Pulled By: z-a-f

fbshipit-source-id: 2a6446b82ea59ce318ca743d4b61008916ea6d5c
2020-03-24 19:22:03 -07:00
b8f509fd9b Revert D20630949: Skip OpenMP thread when OMP_NUM_THREADS is set to 1
Test Plan: revert-hammer

Differential Revision:
D20630949

Original commit changeset: 0b6f1ba5b535

fbshipit-source-id: a29c3e3a1b20441581009e16eaf4c893725be8d3
2020-03-24 18:25:08 -07:00
574be9f816 Skip OpenMP thread when OMP_NUM_THREADS is set to 1 (#35324)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35324

When the OMP_NUM_THREADS is set to 1, we don't need to launch the parallel_for function on an OpenMP thread since there is no intra-op parallelism. By avoiding that, we can reduce the unnecessary context switches.

Test Plan: internal

Reviewed By: ilia-cher

Differential Revision: D20630949

fbshipit-source-id: 0b6f1ba5b535dafedb16742145a70cc4bb4872a2
2020-03-24 17:46:19 -07:00
a7f8655314 Revert D20624571: [pytorch][PR] [TensorExpr] Extend arithmetic simplifier to work with multi variable expressions
Test Plan: revert-hammer

Differential Revision:
D20624571

Original commit changeset: e49049377bee

fbshipit-source-id: 7d8dda0c3b44be1c3236a0313bbfa128b7015de7
2020-03-24 16:59:51 -07:00
ee7cd84fac Revert D20589145: [quant][graphmode] Add quantization support for aten::cat
Test Plan: revert-hammer

Differential Revision:
D20589145

Original commit changeset: c9159fffa88c

fbshipit-source-id: c6b8db13ed1ed19f4437b2fa70d88ce139d445e1
2020-03-24 16:24:22 -07:00
f1efe51028 add quantized version of hardswish operator (#34820)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34820

Adds quantized version of hardswish, for common quantized operator coverage.

Note:
* we carry over scale and zero_point from the input to the output, because the
  range of the output is unbounded if x > 0
* we also skip the .out function to not allow the user to specify a custom
  scale+zp (flexible on this).

Test Plan:
```
python test/test_quantized.py

https://gist.github.com/vkuzo/f9b579315ed7f5fdb24839e3218d8465
```

Imported from OSS

Differential Revision: D20472905

fbshipit-source-id: 0f2a83e9f5f7b43485fa46caf30e756dc5d492a9
2020-03-24 15:16:58 -07:00
f3e9fa6122 add hardswish FP operator (#34747)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34747

Adds the hardswish FP operator from MobileNetV3 to PyTorch. This is for
common operator coverage, since this is widely used.  A future PR will
add the quantized version.  CUDA is saved for a future PR as well.

Test Plan:
tests pass:
```
python test/test_torch.py TestTorchDeviceTypeCPU.test_hardswish_cpu_float32
```

microbenchmark:
https://gist.github.com/vkuzo/b10d3b238f24e58c585314e8b5385aca
(batch_size == 1: 11.5GiB/s, batch_size == 4: 11.9GiB/s)

Imported from OSS

Differential Revision: D20451404

fbshipit-source-id: c7e13c9ab1a83e27a1ba18182947c82c896efae2
2020-03-24 15:15:34 -07:00
b6306e1517 Revert D20624698: [pytorch][PR] Make GPU loops support mutable lambda
Test Plan: revert-hammer

Differential Revision:
D20624698

Original commit changeset: 06e398779345

fbshipit-source-id: d17059c692b4b460f3aa8081bc80c296ddb88228
2020-03-24 14:42:40 -07:00
4a84ac5f5d [jit] make Future type annotation available in Python (#27637)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/27637

Fixes https://github.com/pytorch/pytorch/issues/26578

Test Plan: Imported from OSS

Differential Revision: D20626866

fbshipit-source-id: 20d6a3a719fddcb33e0e17a56d7123535fa20d65
2020-03-24 14:36:05 -07:00
2623448746 Match case of package name to suppress warning (#35201)
Summary:
ee4673c1ae

`find_package(Torch)` is used most of the time: https://pytorch.org/cppdocs/installing.html
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35201

Differential Revision: D20627251

Pulled By: albanD

fbshipit-source-id: edc151ca437f1f0a778b9834db481bbc887cb9f5
2020-03-24 14:23:04 -07:00
fce67800f4 [TensorExpr] Extend arithmetic simplifier to work with multi variable expressions (#35127)
Summary:
A new version of the IR simplifier used by the jit/tensorexpr fuser. This is capable of simplifying expressions containing (shock) multiple variables, eg:

```(m * (1 * n_1) + (n  + 1)) - (m *  (1 * n_1) + n) => 1```

Similar to the previous IR Simplifier it uses a two stage approach:
1. Traverse the tree combining subtree's of commutable operations in to a flat structure. In this implementation we have two intermediate Exprs: Term (expressing products of sub expressions) and Polynomial (expressing sums of sub expressions).
2. Traverse the tree expanding Term's and Polynomials into their component operators.

Using the example above we execute with a process like this to simplify:
```
   (m * (1 * n_1) + (n  + 1)) - (m *  (1 * n_1) + n)
# Using PolynomialTransformer:
=> Sub(Add(Mul(m, Mul(1, n_1)), Add(n, 1)), Add(Mul(m, Mul(1, n_1)), n))
=> Sub(Polynomial(Term(m, n_1), n, 1), Polynomial(Term(m, n_1), n))
=> Polynomial(Term(m, n_1), Term(-1, m, n_1), n, -n, 1)
=> Polynomial(1)
# Using TermExpander
=> 1
```

The IRSimplifier supports arithmetic simplifications of operators Add, Sub and Mul and constant folding of all binary Exprs and Intrinsics, but does not attempt expansion of multiplication of Polynomials to the canonical form since that generally leads to less efficient representations. It will do scalar factorization if it results in removal of operators, and will merge chains of multilane primitives (such as Broadcast and Ramp) down into a single operator. The ir_simplifier unit tests are a short tour of its capabilities.

The existing simplifier has a bug where it will sometimes reorder operations on floating point types which are not associative. This causes (at least) the pyhpc equation_of_state benchmark to produce incorrect results. I have fixed that issue in this version and verified that that benchmark produces the same results with and without the simplifier.

Tests: all cpp & py tensorexpr tests, and pyphc benchmark:
```
benchmarks.equation_of_state
============================
Running on CPU

size          backend     calls     mean      stdev     min       25%       median    75%       max   Δ
------------------------------------------------------------------------------------------------------------------
   4,194,304  pytorch           10     0.246     0.002     0.243     0.245     0.246     0.248     0.250     1.000
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35127

Differential Revision: D20624571

Pulled By: nickgg

fbshipit-source-id: e49049377beee69e02dcf26eb922bef1447ae776
2020-03-24 14:16:07 -07:00
2dc2933358 Move NewModuleTest and NewCriterionTest from test_nn.py to common_nn.py (#35189)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/35189

Test Plan: Imported from OSS

Differential Revision: D20588197

Pulled By: yf225

fbshipit-source-id: 5a28159b653895678c250cbc0c1ddd51bc7a3123
2020-03-24 14:05:45 -07:00
6b5740c5f6 [quant][graphmode] Add quantization support for aten::cat (#34346)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/34346

Test Plan:
python test/test_jit.py

Imported from OSS

Differential Revision: D20589145

fbshipit-source-id: c9159fffa88cf25fcdccfcc4eef2622cf4b250b5
2020-03-24 13:56:43 -07:00
b7e4dd15cc [jit] Remove stray @script (#34938)
Summary:
Stacked PRs
 * **#34938 - [jit] Remove stray `script`**
 * #34935 - [jit] Add lazy script decorator

Pull Request resolved: https://github.com/pytorch/pytorch/pull/34938

Pulled By: driazati

Differential Revision: D20569793

fbshipit-source-id: 1f126646f7bd7c4ea972e15023eaa60f0e301351
2020-03-24 13:44:38 -07:00
44622bbda9 [jit] Add lazy script decorator (#34935)
Summary:
Stacked PRs
 * #34938 - [jit] Remove stray `script`
 * **#34935 - [jit] Add lazy script decorator**

Some users maintain libraries of code that is largely trace-able but not
script-able. However, some functions may need to be `torch.jit.script`ed if
they contain control flow so the tracer will use the compiler version.
This however impacts library start up time as in #33418, so this PR adds
a workaround in the form of a `torch.jit._lazy_script_while_tracing`
that will only initialize the compiler if the function is called while
actually tracing.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/34935

Pulled By: driazati

Differential Revision: D20569778

fbshipit-source-id: d87c88c02b1abc86b283729ab8db94285d7d4853
2020-03-24 13:43:18 -07:00
84dc8c410a Add's workaround for ScalarType::Byte for cuda (#35027)
Summary:
This PR add's a workaround for `cuda` for `ScalarType::Byte` for the `AT_DISPATCH_*` macros.
As discussed here:
https://github.com/pytorch/pytorch/issues/34826
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35027

Differential Revision: D20596555

Pulled By: colesbury

fbshipit-source-id: 72e842603723a5aa146e4224e79befafc62f2624
2020-03-24 12:51:57 -07:00
39a101d06e Make GPU loops support mutable lambda (#35015)
Summary:
I will need it for https://github.com/pytorch/pytorch/pull/34004

The `mutable` qualifier allows a lambda to capture some values, and modify its own copy. This would be useful for random kernels: we capture a `state` of RNG, initialize it when it first run, and the initialized stated will be used later:

```C++
gpu_kernel(iter, [state, initialized](scalar_t arg) mutable -> scalar_t {
  if (!initialized) {
    curand_init(..., state);
    initialized = true;
  }
  return some_math(curand_uniform(state), arg);
}
```

The `operator()` of `mutable` lambda is not `const`, so we can not pass it as constant reference. It can not be called inside a non-`mutable` lambda either.

Example usage:

```C++
auto t = at::empty({4096}, kCUDA);
float thread_work_index_ = 0;
auto iter = TensorIterator::nullary_op(t);
gpu_kernel(iter, [thread_work_index_]GPU_LAMBDA() mutable -> float {
  return thread_work_index_++;
});
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35015

Differential Revision: D20624698

Pulled By: ngimel

fbshipit-source-id: 06e3987793451cd514181d20252510297e2d28a9
2020-03-24 12:30:49 -07:00
edad9c102d Update XNNPACK to Git revision 1b354636b5942826547055252f3b359b54acff95. (#35081)
Summary:
Required to fix a build issue in https://github.com/pytorch/pytorch/issues/33766.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35081

Reviewed By: dreiss

Differential Revision: D20567230

Pulled By: AshkanAliabadi

fbshipit-source-id: 1ed61708851402f60b80abc818ae7330e43adb83
2020-03-24 12:24:30 -07:00
abcd4eb993 Optimize min and max(reduce_dim) performance on CPU (#34875)
Summary:
This PR is about improve min and max(reduce_dim) performance on CPU.
Test script:
```
 import torch
 import torch.nn as nn
 import time

 torch.manual_seed(0)

 def _time():
     return time.time()

 device = "cpu"
 torch.set_num_threads(1)

 #warm up
 for n in [10, 200]:
     # contiguous
     # input = torch.randn(n, n, n, requires_grad=False, device=device)
     # discontiguous
     input = torch.randn(n, 2*n, n, requires_grad=False, device=device)[:, :n, :]
     for dim in range(input.dim()):
         for i in range(1000):
             output = input.min(dim)
             #output = input.max(dim)

 for n in [10, 200]:
     # contiguous input
     # input = torch.randn(n, n, n, requires_grad=False, device=device)
     # discontiguous
     input = torch.randn(n, 2*n, n, requires_grad=False, device=device)[:, :n, :]
     for dim in range(input.dim()):
         fwd_t = 0
         for i in range(10000):
             t1 = _time()
             output = input.min(dim)
             #output = input.max(dim)
             t2 = _time()
             fwd_t = fwd_t + (t2 -t1)
         fwd_avg = fwd_t / 10000 * 1000
         print("size = (%d, %d, %d); reduce dim=%d; compute time is %.4f(ms)" % (n, n, n, dim, fwd_avg))
```
Test device: **skx-8180**.

### Contiguous case.
- num_threads = 56

  | Bef(ms) | Bef(ms) | Bef(ms) | Aft(ms) | Aft(ms) | Aft(ms)
-- | -- | -- | -- | -- | -- | --
size | dim=0 | dim=1 | dim=2 | dim=0 | dim=1 | dim=2
n=10 | 0.0243 | 0.0243 | 0.0244 | 0.0063 | 0.0065 | 0.0063
n=200 | 0.9615 | 0.9453 | 0.7772 | 0.2937 | 0.2675 | 0.2607

- num_threads = 1

  | Bef(ms) | Bef(ms) | Bef(ms) | Aft(ms) | Aft(ms) | Aft(ms)
-- | -- | -- | -- | -- | -- | --
size | dim=0 | dim=1 | dim=2 | dim=0 | dim=1 | dim=2
n=10 | 0.0126 | 0.0126 | 0.0114 | 0.0062 | 0.0065 | 0.0064
n=200 | 32.1276 | 33.3489 | 29.0757 | 8.0556 | 7.0188 | 6.5014

### Discontiguous case.
- num_threads = 56

  | Bef(ms) | Bef(ms) | Bef(ms) | Aft(ms) | Aft(ms) | Aft(ms)
-- | -- | -- | -- | -- | -- | --
size | dim=0 | dim=1 | dim=2 | dim=0 | dim=1 | dim=2
n=10 | 0.0106 | 0.0115 | 0.0131 | 0.0063 | 0.0066 | 0.0065
n=200 | 14.652 | 15.3496 | 9.8153 | 0.2946 | 0.2708 | 0.267
- num_threads = 1

  | Bef(ms) | Bef(ms) | Bef(ms) | Aft(ms) | Aft(ms) | Aft(ms)
-- | -- | -- | -- | -- | -- | --
size | dim=0 | dim=1 | dim=2 | dim=0 | dim=1 | dim=2
n=10 | 0.0108 | 0.0116 | 0.0132 | 0.0058 | 0.0062 | 0.0061
n=200 | 12.5132 | 13.0785 | 9.6738 | 8.3733 | 7.3051 | 6.4566

https://github.com/pytorch/pytorch/issues/24671 and https://github.com/pytorch/pytorch/issues/24672 are also fixed.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34875

Differential Revision: D20605596

Pulled By: ngimel

fbshipit-source-id: 08fd4dacd1db63309123d7ec5942a4b8a0071896
2020-03-24 12:12:27 -07:00
fb70893e78 remove cadd_avx2 dead code (#34883)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/34883

Test Plan: Imported from OSS

Differential Revision: D20611526

Pulled By: ngimel

fbshipit-source-id: 78c80b7361119fc8d2b9f6b4f0c86b61723fe05d
2020-03-24 12:00:56 -07:00
3f896ef743 Trying pinning pyyaml and setuptools on macos to older version (#35296)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35296

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Differential Revision: D20624843

Pulled By: ezyang

fbshipit-source-id: 9028f1dd62d0c25e916eb4927fd8dd6acbd88886
2020-03-24 11:53:10 -07:00
0f0a5b11b8 Disable C4251 when compiling cpp_extensions on Windows (#35272)
Summary:
Otherwise, VC++ will warn that every exposed C++ symbol, for example:
```
include\c10/core/impl/LocalDispatchKeySet.h(53): warning C4251: 'c10::impl::LocalDispatchKeySet::included_': class 'c10::DispatchKeySet' needs to have dll-interface to be used by clients of struct 'c10::impl::LocalDispatchKeySet'
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35272

Test Plan: CI

Differential Revision: D20623005

Pulled By: malfet

fbshipit-source-id: b635b674159bb9654e4e1a1af4394c4f36fe35bd
2020-03-24 11:08:28 -07:00
1d52530855 simpler 'cpu_scatter_gather_base_kernel' (#34690)
Summary:
Simplifies `cpu_scatter_gather_base_kernel` to accept only binary operations and spares them from doing redundant checks.
CC v0dro
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34690

Differential Revision: D20604814

Pulled By: ngimel

fbshipit-source-id: 5e22c2f39a8e2861dc763454c88796d1aa38d2eb
2020-03-24 11:00:59 -07:00
55019d357e [quant][graphmode] Add observers for dynamic quant (#35121)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35121

For dynamic quantization we insert observers at the input to mimic the quatization of activations that happens in the operator
Observer for weight is inserted similar to static quant

Test Plan:
python test/test_quantize_script.py

Sample output for single layer FC

.graph(%self : __torch__.___torch_mangle_4.M,
      %x.2 : Tensor):
  %_observer_1 : __torch__.torch.quantization.observer.MinMaxObserver = prim::GetAttr[name="_observer_1"](%self)
  %x.1 : Tensor = prim::CallMethod[name="forward"](%_observer_1, %x.2)
  %2 : __torch__.torch.nn.modules.linear.___torch_mangle_5.Linear = prim::GetAttr[name="fc"](%self)
  %3 : Tensor = prim::CallMethod[name="forward"](%2, %x.1) # test/test_quantize_script.py:19:23
  return (%3)

graph(%self : __torch__.torch.nn.modules.linear.___torch_mangle_5.Linear,
      %input.1 : Tensor):
 %2 : Function = prim::Constant[name="linear"]()
 %3 : Tensor = prim::GetAttr[name="weight"](%self)
 %_observer_0 : __torch__.torch.quantization.observer.MinMaxObserver = prim::GetAttr[name="_observer_0"](%self)
 %7 : Tensor = prim::CallMethod[name="forward"](%_observer_0, %3)
 %4 : Tensor = prim::GetAttr[name="bias"](%self)
 %5 : Tensor = prim::CallFunction(%2, %input.1, %7, %4) # /home/supriyar/miniconda3/envs/pytorch_py3/lib/python3.7/site-packages/torch/nn/modules/linear.py:87:15
 return (%5)

Imported from OSS

Differential Revision: D20599144

fbshipit-source-id: 9a8fa0e8655b9908826b981dce8a11d86efce5df
2020-03-24 10:54:16 -07:00
a045343402 [Autograd Testing] Add a test where child reentrant task fails. (#35223)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35223

Adding tests as part of
https://github.com/pytorch/pytorch/issues/34367.

This test covers
"Mixed with errors" ->
"Reentrant on same device" ->
"Make child error before parent finishes"
ghstack-source-id: 100725947

Test Plan: waitforbuildbot

Differential Revision: D20603127

fbshipit-source-id: 08484b0a98053491459e076bdd23caf042c47150
2020-03-24 10:42:37 -07:00
36e3c005f0 Add python excepiton handling catch block to resolve deadlock (#35283)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35283

https://github.com/pytorch/pytorch/issues/34260

Deadlock on destructing py::error_already_set.

There are request callback impls in Python, where Python exceptions could be thrown. For releasing Python exception py::objects, GIL must be held.

Test Plan:
```
buck test mode/dev-nosan //caffe2/test/distributed/rpc/jit:rpc_fork

buck build mode/dev-nosan //caffe2/test/distributed/rpc/jit:rpc_fork && \
buck-out/gen/caffe2/test/distributed/rpc/jit/rpc_fork\#binary.par \
-r test_torchscript_functions_not_supported
```

Differential Revision: D7753253

fbshipit-source-id: 4bfaaaf027e4254f5e3fedaca80228c8b4282e39
2020-03-24 10:19:57 -07:00
925cdd57dc Replace all uses of AT_INDEX_ERROR with TORCH_CHECK_INDEX (#35050)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/35050

Differential Revision: D20550978

Pulled By: ezyang

fbshipit-source-id: df7c0730f27d2986601b1dee17e41957be2956de
2020-03-24 09:10:51 -07:00
0f0271e255 [RELAND2] Eager autocasting, out-of-place ops only (with MSVC 2017 fix) (#35102)
Summary:
This is the second reland attempt for https://github.com/pytorch/pytorch/pull/32140.

The first reland attempt https://github.com/pytorch/pytorch/pull/35011 failed due a [small incompatible change](https://github.com/pytorch/pytorch/pull/35011#issuecomment-601754216) in recent master (`skipIfRocm` was removed from `test_data_parallel.py`).

The present PR restores skipIfRocm.

Description from first reland attempt https://github.com/pytorch/pytorch/pull/35011:

> https://github.com/pytorch/pytorch/pull/32140 was approved and merged, but [reverted](d0577e19f0) because it broke builds with versions of Visual Studio older than 15.8 that were not represented in public CI.  The build failures were caused by a [known VS bug](https://developercommunity.visualstudio.com/content/problem/27729/allow-function-with-internal-linkage-as-template-n.html), fixed in versions 15.8 and newer.
>
> The present PR reverts the revert (restoring https://github.com/pytorch/pytorch/pull/32140 's diffs) and adds a workaround to enable compilation with VS < 15.8.  The workaround isn't pretty, but it's guarded by macros such that it's only used when compiling with VS < 15.8.  All other builds compile with the same code/control flow as was merged in https://github.com/pytorch/pytorch/pull/32140.
>
> Original description of https://github.com/pytorch/pytorch/pull/32140:
> > Initial integration of eager autocasting, supporting out-of-place ops only for easier review.
> Relevant issue/RFC: https://github.com/pytorch/pytorch/issues/25081
>
> > In-place ops and ops with user-supplied out=... can certainly be supported as well (my initial WIP https://github.com/pytorch/pytorch/issues/29552 handled many) but require substantially more complex special casing in the autocasting backend and tests. Support for these ops (much of which has already been written) will be broken into later PRs.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35102

Differential Revision: D20596918

Pulled By: ezyang

fbshipit-source-id: 60caa279bb0ce4a9bb0b28c1d585d42cf1cc7e50
2020-03-24 09:08:04 -07:00
50eb1a389f Add cpu_serial_kernel_vec (#34553)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34553

This allows vectorized looping in a serial iteration over
TensorIterator.

Test Plan: Imported from OSS

Differential Revision: D20604238

Pulled By: ezyang

fbshipit-source-id: 61c451dac91d47cde7e1a937b271ab78c79e05d3
2020-03-24 08:57:50 -07:00
73a36a47a5 Gradcheck for complex (#35238)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/35238

Differential Revision: D20607581

Pulled By: anjali411

fbshipit-source-id: 2caf78314a87461b255fd65c7f71c72e152b5161
2020-03-24 08:40:14 -07:00
6f6436ff5d Fix input overwriting in irfft (#35219)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/34551
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35219

Differential Revision: D20605330

Pulled By: ezyang

fbshipit-source-id: a62f1685779bb05c3682255bb3a3f6f9ec35814f
2020-03-24 08:27:06 -07:00
d7a7bcb042 Load torch_global_deps for Windows (#35177)
Summary:
Fixes https://discuss.pytorch.org/t/torch-cat-runtimeerror-error-in-loadlibrarya/71188/8.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35177

Differential Revision: D20604654

Pulled By: ezyang

fbshipit-source-id: 263eb401300812fd336ff820c53b543342dca95e
2020-03-24 08:20:45 -07:00
618c6214aa [reapply][JIT] Namespaces for TorchBind (#35254)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35254

Reapply D20541090 with some BC fixes
ghstack-source-id: 100733987

Test Plan: buck test mode/dev-nosan //caffe2/torch/fb/predictor/model_repo/tests:ai_infra_representative_model_shard_6_test -- 'RepresentativeModelTest\/ShardedRepresentativeModelTest\.RunModel\/0'

Reviewed By: zdevito

Differential Revision: D20607111

fbshipit-source-id: 80f148d860571208c93e9308128cd480ff089f74
2020-03-24 00:39:48 -07:00
17068ba467 [JIT] BC shim for TorchBind classes (#35240)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35240

This makes it so that if we have an old serialized TorchBind class, we don't try to load it in and instead rely on the ClassType that's in memory.
ghstack-source-id: 100703946

Test Plan: buck test mode/dev-nosan //caffe2/torch/fb/predictor/model_repo/tests:ai_infra_representative_model_shard_6_test -- 'RepresentativeModelTest\/ShardedRepresentativeModelTest\.RunModel\/0'

Reviewed By: zdevito

Differential Revision: D20605681

fbshipit-source-id: 5403f68937f822914c701d9c80573f0b4a93e83b
2020-03-24 00:38:02 -07:00
8b8af0d458 Revert D20539336: [JIT] add IR complexity tests
Test Plan: revert-hammer

Differential Revision:
D20539336

Original commit changeset: 14ac00a7b2b0

fbshipit-source-id: 1a51b461e88720599faf04dd3ca443d87f4de66d
2020-03-23 23:24:17 -07:00
7c1ea736ba Extends true_divide to be a method (#34794)
Summary:
Per title. See related https://github.com/pytorch/pytorch/pull/34570.

In PyTorch 1.7 the plan is for torch.div and Python's division operator to perform "true" division, like Python 3, JAX, and NumPy. To facilitate this change, this PR expands true_divide to be a method so it can cover all of torch.div's use cases.

New true_divide tests are added to test_torch.py, test_type_promotion.py, and test_sparse.py.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34794

Differential Revision: D20545507

Pulled By: mruberry

fbshipit-source-id: 55286f819716c8823d1930441a69008560ac2bd5
2020-03-23 23:12:23 -07:00
cd75d4e274 [quant][graphmode] Add prim::ListConstruct to general op handling (#34345)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34345

prim::ListConstruct is similar to an op that doesn't require observation
we want to make sure we can propagate observed property through it

Test Plan:
this will be tested when we add support for cat
https://github.com/pytorch/pytorch/pull/34346

Imported from OSS

Differential Revision: D20524455

fbshipit-source-id: b5f8e0c8776d48d588aeba6735de06dcd308560e
2020-03-23 22:33:43 -07:00
537fdd77d5 [quant][graphmode] quantization support for view, transpose, contiguos (#34854)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/34854

Test Plan:
python test/test_jit.py

Imported from OSS

Differential Revision: D20524456

fbshipit-source-id: e6e8fc3db6cccbd32c210d04f921274d81996fe2
2020-03-23 22:33:39 -07:00
4a96911629 [quant][graphmode] quantization support for aten::chunk (#34806)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/34806

Test Plan:
python test/test_jit.py

Imported from OSS

Differential Revision: D20524454

fbshipit-source-id: 92ac9bc251581e963258cb90dc3de73f8508c822
2020-03-23 22:33:34 -07:00
9c8f09d1a4 [quant][graphmode] quantization support for prim::ListUnpack (#34807)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/34807

Test Plan:
python test/test_jit.py
in https://github.com/pytorch/pytorch/pull/34806

Imported from OSS

Differential Revision: D20524452

fbshipit-source-id: 31956894d6be58b6ba96b02d338dd1fd802aeefc
2020-03-23 22:31:52 -07:00
c46c28a7cb Fix JitTest.ADFormulas intermittent failures (#35196)
Summary:
Clamp input tensor values to [3, 3] to limit how small `tanh` gradint can get
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35196

Test Plan: CI + `bin/test_jit --gtest_filter=JitTest.ADFormulas --gtest_repeat=60000 --gtest_break_on_failure`

Differential Revision: D20611256

Pulled By: malfet

fbshipit-source-id: 8640faa5d8567d6c6df8cc5df80c2e65407116eb
2020-03-23 22:21:30 -07:00
9e7821ee82 [autograd] allow PyNode to persist error message (#34845)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34845

This PR allows PyNode to persist the error message so that any pure C++
thread that runs autograd with custom Python autograd function can successfully
catpure the error message without maintaining a initial PyThreadState.

Test Plan: Imported from OSS

Differential Revision: D20480685

Pulled By: wanchaol

fbshipit-source-id: 0488ea5a4df9a33b53ac5d0d59000c41ab6cb748
2020-03-23 21:54:28 -07:00
8346959f38 [caffe2] merge internal (RowWise)SparseAdagrad into open source version (#35090)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35090

As a preparation to open source fp16 + stochastic rounding SparseAdagrad and fused SparseAdagrad

Other minor changes:
* Removed template parameters T that are not actually used
* Removed unnecessary anonymous namespaces used in header files

Test Plan: CI

Reviewed By: jianyuh

Differential Revision: D20552770

fbshipit-source-id: 224fdca15ea786620ce88e33cbcbf97661423538
2020-03-23 21:48:08 -07:00
ac4a0224f3 [quant][graphmode] Replicate quantize node for prim::If (#34804)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34804

We want to replicate the quantize node for return values in blocks of prim::If
in order to create the quantization patterns.

Test Plan:
python test/test_jit.py

Imported from OSS

Differential Revision: D20524453

fbshipit-source-id: 2268ac555f646158f4e1ffc98ccc8101d7504194
2020-03-23 21:20:45 -07:00
340ccf56fb [PyTorch-RPC] In process_group_agent, avoid read-after-free (#35252)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35252

The torch::from_blob() syntax without a deleter syntax is relatively
dangerous and explicitly assumes that the caller will correctly persist
the tensor bits for as long as necessary.

We were at one point correctly persisting the send tensor bits in
process_group_agent, but with the early-return codepaths are not
doing so any longer.

This change switches to a more robust approach where we instead just use
the torch::from_blob-with-deleter syntax, and use std::move to avoid
a copy. There's an extra malloc, but that's effectively free compared with
the rest of the work involved here. And it means we don't have to worry
about the Tensor memory vanishing from underneath the send anymore.

The initial motivation here was dist_autograd_node_failure flakiness.

While the motivating case is handleSend(), we also fix handlePendingMessage().
ghstack-source-id: 100704883

Test Plan:
existing test coverage, e.g.
   buck test mode/dev-nosan caffe2/torch/fb/distributed/thriftRpcBackend/test:ProcessGroupAgentTest

Differential Revision: D20607028

fbshipit-source-id: cf9966c5aa9472830cfefaf7fc2f92af9b52630d
2020-03-23 20:57:09 -07:00
fddcd72a31 Add the more fusion (conv3d and batchnorm)support in pytorch quantization flow (#33540)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/33540

Differential Revision: D19994498

Pulled By: lly-zero-one

fbshipit-source-id: e5e13eab6924bd2ce1b57b16b672844b8b9638f5
2020-03-23 20:36:03 -07:00
bd0ef784e0 FAQ: Add note about recovering from OOM (#35214)
Summary:
Closes https://github.com/pytorch/pytorch/issues/18853

This documents the workaround needed to solve the issues in https://github.com/pytorch/pytorch/issues/18853
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35214

Differential Revision: D20604877

Pulled By: ezyang

fbshipit-source-id: 71ed13cfa567d8e88fa9f18180a171cd174fb528
2020-03-23 20:22:46 -07:00
97ecfb4929 Updating submodules
Summary:
GitHub commits:

0bd7af23b3
4d2f1db963
d300d10962
9eafc02456
a2f301b725
b0395f0b0e
dcb403515c
74a3c0ae53
60b734a667

Test Plan: n/a

Reviewed By: wittgenst

fbshipit-source-id: 6fb6754598b16939b1b8acd7ea7d022dd7ee473c
2020-03-23 20:16:57 -07:00
ccf8dd6209 Print exitcode on failures in test_distributed.py (#35233)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/35233

Test Plan: Imported from OSS

Differential Revision: D20605264

Pulled By: mrshenli

fbshipit-source-id: f3d814943c88c2e2fa5da7e642bbcf9a405d08e6
2020-03-23 18:30:17 -07:00
1b119861a8 [TensorExpr] Cleanup includes in loopnest.h, use forward declarations when possible. (#35129)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/35129

Test Plan: Imported from OSS

Differential Revision: D20569733

Pulled By: ZolotukhinM

fbshipit-source-id: c746c5e705ff79bd8c60c1ec94aa2319dfd669e1
2020-03-23 18:28:47 -07:00
b9fbec96e6 Support LIST_UNPACK and TUPLE_SLICE in lite interpreter. (#35241)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/35241

Test Plan: Imported from OSS

Differential Revision: D20609439

Pulled By: iseeyuan

fbshipit-source-id: 4f352b8641c203aaf9f2204e4080876bd4d47b0c
2020-03-23 18:21:26 -07:00
eff68bc872 [quant][graphmode] quantization support for aten::add (#34572)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/34572

Test Plan:
python test/test_jit.py

Imported from OSS

Differential Revision: D20519607

fbshipit-source-id: c57e062cffc24a47a76b73b58aff7f9ef80183fa
2020-03-23 17:52:28 -07:00
b2dcedf71e .circleci: Ensure describe happens in pytorch repo (#35065)
Summary:
Found an issue where the git describe wasn't properly executed since the
binary_populate_env.sh script was being executed from a different
directory.

'git -C' forces the describe to run in the running directory for the
script which should contain the correct git information

Signed-off-by: Eli Uriegas <eliuriegas@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35065

Differential Revision: D20603172

Pulled By: seemethere

fbshipit-source-id: b19112ce4cb2dc45fbb3f84dedc4f1d3f2259748
2020-03-23 17:45:09 -07:00
8bb7f1ad11 [rpc] various fixes for ProcessGroupAgent (#34943)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34943

Follow up to address Jeremy's and Shen's comments on
https://github.com/pytorch/pytorch/pull/34413:
1) Continue trying even if one `agent->send()` fails when cleaning up dist
autograd ctx
2) Use RAII for lock in process group agent `handleSend`
3) Return bool instead of int in `ProcessGroupAgent::handleRecv` to determine
if the count should be incremented
4) Move recvCounts increment in timed out future processing to be within the
block that ensures the future already doesn't have an error.
ghstack-source-id: 100681746

Test Plan: CI

Differential Revision: D20506065

fbshipit-source-id: 14a2820b3ae7a65edd103f0b333c4bc21e821235
2020-03-23 17:43:32 -07:00
c321f02756 Follow up on freezing (#34786)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34786

1) Rename 'HashIValue' to 'HashAliasedIValue'
2) Added Object case in getSubValues function
3) Hashes tensors to their storage
4) Added Dict case in orverrideGradient
5) nit clean up

Test Plan: Imported from OSS

Differential Revision: D20585270

Pulled By: bzinodev

fbshipit-source-id: f580f3cb80dd5623088a014efd5f0f5ccc1659c0
2020-03-23 17:25:40 -07:00
7ab25b2e6b [JIT] add id function (#34975)
Summary:
add `id` function so to give uses a way of keeping a `seen` set of nn modules.
n practice, this is only used between values of `T` and `T` or `T` and `Optional[T]`, so in this implementation I made it so that None is the only value that can be zero. Python also only guarantees `id()` gives semantically meaningful results for pointer types.

EDIT: now only allowing id on class types
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34975

Reviewed By: driazati

Differential Revision: D20599564

Pulled By: eellison

fbshipit-source-id: 3c6666a9b9b0258198adc70969dd6332e3375e4f
2020-03-23 17:10:13 -07:00
131af4412e Add TORCH_CUDA_API to FilterDescriptor (#35131)
Summary:
`FilterDescriptor` is missing a `TORCH_CUDA_API`, so this symbol is not exported from `torch_cuda.so`, and users could have trouble building cpp_extension when using cudnn.

cc: ptrblck
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35131

Differential Revision: D20604439

Pulled By: ezyang

fbshipit-source-id: c57414fc8a9df9cb1e910e2ec0a48cfdbe7d1779
2020-03-23 17:04:19 -07:00
6fa0b3df2e [testing] Pass verbosity settings to XMLTestRunner (#35224)
Summary:
When `unittest.main()` is invoked with custom testRunner, verbosity settings for the runner must be set manually
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35224

Test Plan: CI

Differential Revision: D20605896

Pulled By: malfet

fbshipit-source-id: 79fc6f55911189b6d8a4bc83bd2390c94bd69e5e
2020-03-23 16:37:52 -07:00
bfdcc39301 in test_c10d.py, remove skip_if_rocm from tests that pass locally (#35124)
Summary:
iotamudelta  Test passed three iterations on the CI, no flakiness detected.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35124

Differential Revision: D20604748

Pulled By: ezyang

fbshipit-source-id: ed013ca27f38a3610108421932245b494fac28c0
2020-03-23 15:57:41 -07:00
40da01db6a Add docs about memory format (#34818)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/34818

Test Plan: Imported from OSS

Differential Revision: D20601336

Pulled By: VitalyFedyunin

fbshipit-source-id: d34ad226be950bf134c6b383a4810ea6aa75599e
2020-03-23 15:06:33 -07:00
93983c7d00 Add USE_TSAN option (#35197)
Summary:
Sometimes it is important to run code with thread sanitizer.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35197

Test Plan: CI

Differential Revision: D20605005

Pulled By: malfet

fbshipit-source-id: bcd1a5191b5f859e12b6df6737c980099b1edc36
2020-03-23 14:56:42 -07:00
a00e12e755 [quant][graphmode] weight/bias of linear/conv can be reused for multiple ops (#35221)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35221

When weight is reused, we only need to insert one observer for the weight

Test Plan:
python test/test_jit.py

Imported from OSS

Differential Revision: D20602492

fbshipit-source-id: e003e6316f6615f3526f0d00fb7b722148b4749b
2020-03-23 14:21:59 -07:00
3cd3f0b3f1 Fix Tensor __radd__ type hint issue (#35231)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35231

Fixes #35213

(Note: this ignores all push blocking failures!)

Test Plan: `mypy -c "import torch; ten = torch.tensor([1.0, 2.0, 3.0]); print(7 + ten)"` should not produce any warnings

Differential Revision: D20604924

Pulled By: pbelevich

fbshipit-source-id: 53a293a99b3f2ab6ca5516b31f3a92f67eb67a39
2020-03-23 14:13:30 -07:00
37e355622a Pass the missed "non_blocking" argument for to() (#35144)
Summary:
The following code
```python
a = torch.randn(42,)
b = a.cuda(non_blocking=True)
```
will be **blocked** in the current master, and will **not be blocked** in pytorch 1.4 release. This can be verified by a `nvprof --print-api-trace python script.py` profiling. It is causing performance issue.

I isolated the problem, and jjsjann123 & ptrblck pointed out the fix. Thanks!

cc csarofeen ptrblck jjsjann123 VitalyFedyunin ngimel
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35144

Differential Revision: D20601163

Pulled By: ngimel

fbshipit-source-id: edd2b1dabd8e615c106188f30ddb3e763bde7471
2020-03-23 13:49:23 -07:00
521c424b39 Make discontiguous tensors also benefit from unrolling (#34708)
Summary:
This is based on https://github.com/pytorch/pytorch/pull/33720, I didn't use stacked diff because is not very convenient for cherry-picking. Please review after https://github.com/pytorch/pytorch/issues/33720 merged.

Benchmark shows an up to ~10% improvement on half on RTX 2080Ti:
https://github.com/zasdfgbnm/things/blob/master/2020Q1/benchmark-unroll-with-discontig-input.ipynb

We now have a `TrivialOffsetCalculator`, and the unroll strategy takes input offset calculator and output offset calculator as arguments of its constructor. In case of when we know that it is contiguous (for example when the unroll strategy is used inside vectorized kernel), the trivial offset calculator will be used.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34708

Differential Revision: D20601566

Pulled By: ngimel

fbshipit-source-id: e20e38517efb31c8af5fc377538992a980ff4130
2020-03-23 13:41:09 -07:00
9441c7a944 [JIT] add IR complexity tests (#34918)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34918

I'm going to set this up as a benchmarking test that runs internally in FB, but soliciting reviews externally first.

I think that benchmarking complexity of our nn module & functional tests is useful because they are the building blocks of models, so they should be pretty representative of generic model complexity . This also separates out complexity benchmarking into tests that are easily debuggable given a regression, instead of a 50K node resnet graph.

For each test, i am testing the profiled graph with consistent shapes, and I am testing
- Number of If & loop statements
- Number of non-tensor nodes (outputs don't include tensor)

This is just a starting off point for testing IR complexity. Future plans could involve:
- adding resnet, or other models in the model repo
- benchmarking number of guards

Current output:
Functional tests:
```
('Name', 'Ifs/Loops', 'non-tensor ops')
('conv1d', 0, 0)
('conv2d', 0, 0)
('conv3d', 0, 0)
('conv_transpose1d', 0, 0)
('conv_transpose2d', 0, 0)
('conv_transpose3d', 0, 0)
('conv_tbc', 0, 0)
('avg_pool1d', 0, 0)
('avg_pool2d', 0, 0)
('avg_pool3d', 0, 0)
('fractional_max_pool2d', 0, 3)
('max_pool1d', 0, 0)
('max_pool1d', 0, 0)
('max_pool2d', 0, 0)
('max_pool2d', 0, 0)
('max_pool3d', 0, 0)
('max_unpool1d', 0, 12)
('max_unpool2d', 0, 22)
('max_unpool3d', 0, 33)
('lp_pool1d', 0, 0)
('lp_pool2d', 0, 0)
('adaptive_max_pool1d', 0, 0)
('adaptive_max_pool2d', 0, 6)
('adaptive_max_pool3d', 0, 9)
('adaptive_avg_pool1d', 0, 0)
('adaptive_avg_pool2d', 0, 6)
('adaptive_avg_pool3d', 0, 9)
('dropout', 0, 0)
('alpha_dropout', 0, 0)
('dropout2d', 0, 0)
('dropout3d', 0, 0)
('feature_alpha_dropout', 0, 0)
('threshold', 0, 0)
('threshold', 0, 0)
('relu', 0, 0)
('relu', 0, 0)
('glu', 0, 0)
('hardtanh', 0, 0)
('hardtanh', 0, 0)
('relu6', 0, 0)
('relu6', 0, 0)
('elu', 0, 0)
('elu', 0, 0)
('selu', 0, 0)
('selu', 0, 0)
('celu', 0, 0)
('celu', 0, 0)
('leaky_relu', 0, 0)
('leaky_relu', 0, 0)
('rrelu', 0, 0)
('rrelu', 0, 0)
('hardshrink', 0, 0)
('tanhshrink', 0, 0)
('softsign', 0, 0)
('softplus', 0, 0)
('softmin', 0, 0)
('softmax', 0, 0)
('softmax', 0, 0)
('tanh', 0, 1)
('sigmoid', 0, 1)
('log_softmax', 0, 0)
('linear', 0, 0)
('linear', 0, 0)
('bilinear', 0, 0)
('embedding', 0, 0)
('embedding_bag', 0, 0)
('batch_norm', 0, 0)
('instance_norm', 1, 6)
('layer_norm', 0, 0)
('layer_norm', 0, 0)
('layer_norm', 0, 0)
('layer_norm', 0, 0)
('group_norm', 3, 53)
('local_response_norm', 0, 0)
('nll_loss', 1, 5)
('poisson_nll_loss', 0, 0)
('poisson_nll_loss', 0, 0)
('kl_div', 0, 1)
('cross_entropy', 1, 5)
('binary_cross_entropy_with_logits', 0, 0)
('smooth_l1_loss', 1, 1)
('l1_loss', 1, 1)
('mse_loss', 1, 1)
('smooth_l1_loss', 1, 1)
('l1_loss', 1, 1)
('mse_loss', 1, 1)
('margin_ranking_loss', 0, 0)
('hinge_embedding_loss', 0, 0)
('soft_margin_loss', 0, 0)
('multilabel_soft_margin_loss', 0, 1)
('cosine_embedding_loss', 0, 0)
('pixel_shuffle', 0, 0)
('affine_grid', 3, 14)
('pad', 0, 0)
('pairwise_distance', 0, 0)
('pdist', 0, 0)
('cosine_similarity', 0, 0)
('triplet_margin_loss', 0, 0)
('normalize', 0, 0)
('unfold', 0, 0)
('fold', 0, 0)
('grid_sample', 0, 1)
('gumbel_softmax', 0, 0)
('gumbel_softmax', 0, 0)
('multilabel_margin_loss', 0, 0)
('multi_margin_loss', 0, 0)
('binary_cross_entropy', 1, 5)
('binary_cross_entropy', 1, 5)
('ctc_loss', 0, 0)
('upsample', 13, 71)
('upsample', 13, 71)
('interpolate', 14, 71)
('interpolate', 13, 70)
('interpolate', 14, 71)
('interpolate', 14, 71)
('interpolate', 13, 70)
('interpolate', 14, 71)
('interpolate', 14, 71)
('interpolate', 13, 70)
('interpolate', 14, 71)
('interpolate', 14, 71)
('interpolate', 13, 70)
('interpolate', 14, 71)
('interpolate', 14, 60)
('interpolate', 13, 58)
('interpolate', 14, 60)
('interpolate', 14, 60)
('interpolate', 13, 58)
('interpolate', 14, 60)
('interpolate', 14, 60)
('interpolate', 13, 58)
('interpolate', 14, 60)
('interpolate', 13, 82)
('interpolate', 14, 82)
('interpolate', 14, 82)
('interpolate', 13, 82)
('interpolate', 14, 82)
('interpolate', 14, 82)
('interpolate', 13, 82)
('interpolate', 14, 82)
('interpolate', 14, 71)
('interpolate', 14, 71)
('interpolate', 15, 106)
('interpolate', 14, 73)
('interpolate', 15, 106)
('interpolate', 14, 73)
('interpolate', 15, 92)
('interpolate', 14, 60)
('interpolate', 15, 94)
('interpolate', 14, 62)
('interpolate', 15, 116)
('interpolate', 14, 82)
('interpolate', 15, 118)
('interpolate', 14, 84)
```
nn module tests:
```
('Name', 'Ifs/Loops', 'non-tensor ops')
('test_nn_Linear', 0, 0)
('test_nn_Linear_no_bias', 0, 0)
('test_nn_Threshold_threshold_value', 0, 0)
('test_nn_Threshold_large_value', 0, 0)
('test_nn_ReLU', 0, 0)
('test_nn_ReLU6', 0, 0)
('test_nn_RReLU', 0, 0)
('test_nn_RReLU_with_up_down', 0, 0)
('test_nn_Hardtanh', 0, 0)
('test_nn_Sigmoid', 0, 0)
('test_nn_Tanh', 0, 0)
('test_nn_Flatten', 0, 0)
('test_nn_Softmax', 0, 0)
('test_nn_Softmax2d', 0, 0)
('test_nn_LogSoftmax', 0, 0)
('test_nn_LogSoftmax_multiparam', 0, 0)
('test_nn_ELU', 0, 0)
('test_nn_Hardshrink', 0, 0)
('test_nn_LeakyReLU', 0, 0)
('test_nn_LeakyReLU_with_negval', 0, 0)
('test_nn_LogSigmoid', 0, 0)
('test_nn_Softplus', 0, 0)
('test_nn_Softplus_beta', 0, 0)
('test_nn_Softplus_beta_threshold', 0, 0)
('test_nn_Softshrink', 0, 0)
('test_nn_Softshrink_lambda', 0, 0)
('test_nn_PReLU_1d', 0, 0)
('test_nn_PReLU_1d_multiparam', 0, 0)
('test_nn_PReLU_2d', 0, 0)
('test_nn_PReLU_2d_multiparam', 0, 0)
('test_nn_PReLU_3d', 0, 0)
('test_nn_PReLU_3d_multiparam', 0, 0)
('test_nn_Softsign', 0, 0)
('test_nn_Softmin', 0, 0)
('test_nn_Softmin_multidim', 0, 0)
('test_nn_Tanhshrink', 0, 0)
('test_nn_FractionalMaxPool2d_ratio', 0, 7)
('test_nn_FractionalMaxPool2d_size', 0, 0)
('test_nn_FractionalMaxPool3d_ratio', 0, 10)
('test_nn_FractionalMaxPool3d_size', 0, 0)
('test_nn_FractionalMaxPool3d_asymsize', 0, 0)
('test_nn_BatchNorm1d_affine', 2, 3)
('test_nn_BatchNorm1d_3d_input', 3, 9)
('test_nn_BatchNorm1d_affine_simple_average', 2, 5)
('test_nn_BatchNorm1d_not_affine', 2, 3)
('test_nn_BatchNorm1d_not_tracking_stats', 0, 0)
('test_nn_BatchNorm1d_3d_input_not_affine', 3, 9)
('test_nn_BatchNorm1d_zero_batch', 3, 9)
('test_nn_BatchNorm2d', 3, 13)
('test_nn_BatchNorm2d_2d_simple_average', 3, 15)
('test_nn_BatchNorm2d_momentum', 3, 13)
('test_nn_BatchNorm2d_not_affine', 3, 13)
('test_nn_BatchNorm2d_not_tracking_stats', 1, 10)
('test_nn_BatchNorm2d_zero_batch', 3, 13)
('test_nn_BatchNorm3d', 3, 17)
('test_nn_BatchNorm3d_3d_simple_average', 3, 19)
('test_nn_BatchNorm3d_momentum', 3, 17)
('test_nn_BatchNorm3d_not_affine', 3, 17)
('test_nn_BatchNorm3d_not_tracking_stats', 1, 14)
('test_nn_BatchNorm3d_zero_batch', 3, 17)
('test_nn_InstanceNorm1d', 1, 6)
('test_nn_InstanceNorm1d_tracking_stats', 1, 6)
('test_nn_InstanceNorm2d', 1, 10)
('test_nn_InstanceNorm2d_tracking_stats', 1, 10)
('test_nn_InstanceNorm3d', 1, 14)
('test_nn_InstanceNorm3d_tracking_stats', 1, 14)
('test_nn_LayerNorm_1d_elementwise_affine', 0, 0)
('test_nn_LayerNorm_1d_no_elementwise_affine', 0, 0)
('test_nn_LayerNorm_3d_elementwise_affine', 0, 0)
('test_nn_LayerNorm_3d_no_elementwise_affine', 0, 0)
('test_nn_LayerNorm_1d_empty_elementwise_affine', 0, 0)
('test_nn_GroupNorm_1d_affine', 3, 53)
('test_nn_GroupNorm_1d_no_affine_IN', 3, 53)
('test_nn_GroupNorm_1d_no_affine_LN', 3, 53)
('test_nn_GroupNorm_2d_affine', 3, 53)
('test_nn_GroupNorm_2d_no_affine_IN', 3, 53)
('test_nn_GroupNorm_2d_no_affine_LN', 3, 53)
('test_nn_Conv1d', 0, 0)
('test_nn_Conv1d_stride', 0, 0)
('test_nn_Conv1d_pad1', 0, 0)
('test_nn_Conv1d_pad2', 0, 0)
('test_nn_Conv1d_pad1size1', 0, 0)
('test_nn_Conv1d_pad2size1', 0, 0)
('test_nn_Conv1d_zero_batch', 0, 0)
('test_nn_Conv1d_dilated', 0, 0)
('test_nn_Conv1d_groups', 0, 0)
('test_nn_ConvTranspose1d', 0, 0)
('test_nn_ConvTranspose1d_no_bias', 0, 0)
('test_nn_ConvTranspose1d_dilated', 0, 0)
('test_nn_ConvTranspose1d_groups', 0, 0)
('test_nn_MaxPool1d', 0, 0)
('test_nn_MaxPool1d_stride', 0, 0)
('test_nn_Conv2d', 0, 0)
('test_nn_Conv2d_strided', 0, 0)
('test_nn_Conv2d_padding', 0, 0)
('test_nn_Conv2d_dilated', 0, 0)
('test_nn_Conv2d_no_bias', 0, 0)
('test_nn_Conv2d_zero_batch', 0, 0)
('test_nn_Conv2d_groups', 0, 0)
('test_nn_Conv2d_groups_thnn', 0, 0)
('test_nn_ConvTranspose2d', 0, 0)
('test_nn_ConvTranspose2d_dilated', 0, 0)
('test_nn_ConvTranspose2d_no_bias', 0, 0)
('test_nn_ConvTranspose2d_groups', 0, 0)
('test_nn_Conv2d_depthwise', 0, 0)
('test_nn_Conv2d_depthwise_with_multiplier', 0, 0)
('test_nn_Conv2d_depthwise_strided', 0, 0)
('test_nn_Conv2d_depthwise_padded', 0, 0)
('test_nn_Conv2d_depthwise_dilated', 0, 0)
('test_nn_MaxPool2d', 0, 0)
('test_nn_AvgPool1d', 0, 0)
('test_nn_AvgPool1d_stride', 0, 0)
('test_nn_AvgPool1d_stride_pad', 0, 0)
('test_nn_AvgPool2d', 0, 0)
('test_nn_AvgPool2d_stride', 0, 0)
('test_nn_AvgPool2d_stride_pad', 0, 0)
('test_nn_AvgPool2d_divisor', 0, 0)
('test_nn_AvgPool2d_divisor_stride', 0, 0)
('test_nn_AvgPool2d_divisor_stride_pad', 0, 0)
('test_nn_LPPool2d', 0, 0)
('test_nn_LPPool2d_norm', 0, 0)
('test_nn_LPPool1d_norm', 0, 0)
('test_nn_LPPool1d', 0, 0)
('test_nn_LocalResponseNorm_1d', 0, 0)
('test_nn_LocalResponseNorm_2d_uneven_pad', 0, 0)
('test_nn_LocalResponseNorm_3d_custom_params', 0, 0)
('test_nn_ReflectionPad1d', 0, 0)
('test_nn_ReflectionPad2d', 0, 0)
('test_nn_ReplicationPad1d', 0, 0)
('test_nn_ReplicationPad2d', 0, 0)
('test_nn_ZeroPad2d', 0, 0)
('test_nn_ZeroPad2d_negative_dims', 0, 0)
('test_nn_ConstantPad1d', 0, 0)
('test_nn_ConstantPad2d', 0, 0)
('test_nn_ConstantPad3d', 0, 0)
('test_nn_Conv3d', 0, 0)
('test_nn_Conv3d_no_bias', 0, 0)
('test_nn_Conv3d_stride', 0, 0)
('test_nn_Conv3d_stride_padding', 0, 0)
('test_nn_Conv3d_zero_batch', 0, 0)
('test_nn_Conv3d_groups', 0, 0)
('test_nn_Conv3d_dilated', 0, 0)
('test_nn_Conv3d_dilated_strided', 0, 0)
('test_nn_ConvTranspose3d', 0, 0)
('test_nn_ConvTranspose3d_dilated', 0, 0)
('test_nn_MaxPool3d', 0, 0)
('test_nn_MaxPool3d_stride', 0, 0)
('test_nn_MaxPool3d_stride_padding', 0, 0)
('test_nn_AvgPool3d', 0, 0)
('test_nn_AvgPool3d_stride', 0, 0)
('test_nn_AvgPool3d_stride_pad', 0, 0)
('test_nn_AvgPool3d_stride_pad_gpu_fixedkw_output', 0, 0)
('test_nn_AvgPool3d_stride_pad_gpu_general_output', 0, 0)
('test_nn_AvgPool3d_stride1_pad0_gpu_input', 0, 0)
('test_nn_AvgPool3d_stride_pad_gpu_input_nooverlap', 0, 0)
('test_nn_AvgPool3d_divisor', 0, 0)
('test_nn_AvgPool3d_divisor_stride', 0, 0)
('test_nn_AvgPool3d_divisor_stride_pad', 0, 0)
('test_nn_AvgPool3d_divisor_stride_pad_gpu_fixedkw_output', 0, 0)
('test_nn_AvgPool3d_divisor_stride_pad_gpu_general_output', 0, 0)
('test_nn_AvgPool3d_divisor_stride1_pad0_gpu_input', 0, 0)
('test_nn_AvgPool3d_divisor_stride_pad_gpu_input_nooverlap', 0, 0)
('test_nn_ReplicationPad3d', 0, 0)
('test_nn_Embedding', 0, 0)
('test_nn_EmbeddingBag_mean', 0, 2)
('test_nn_EmbeddingBag_sum', 0, 2)
('test_nn_EmbeddingBag_max', 0, 2)
('test_nn_EmbeddingBag_sparse', 0, 2)
('test_nn_Embedding_sparse', 0, 0)
('test_nn_PixelShuffle', 0, 0)
('test_nn_AdaptiveMaxPool1d', 0, 0)
('test_nn_AdaptiveMaxPool2d_single', 0, 6)
('test_nn_AdaptiveMaxPool2d_tuple', 0, 6)
('test_nn_AdaptiveMaxPool3d_single', 0, 9)
('test_nn_AdaptiveMaxPool3d_tuple', 0, 9)
('test_nn_AdaptiveMaxPool3d_single_nonatomic', 0, 9)
('test_nn_AdaptiveMaxPool3d_tuple_nonatomic', 0, 9)
('test_nn_AdaptiveAvgPool1d', 0, 0)
('test_nn_AdaptiveAvgPool1d_one_output', 0, 0)
('test_nn_AdaptiveAvgPool2d_single', 0, 6)
('test_nn_AdaptiveAvgPool2d_single_1x1output', 0, 6)
('test_nn_AdaptiveAvgPool2d_tuple', 0, 6)
('test_nn_AdaptiveAvgPool3d_single', 0, 9)
('test_nn_AdaptiveAvgPool3d_tuple', 0, 9)
('test_nn_SELU', 0, 0)
('test_nn_SELU_scalar', 0, 0)
('test_nn_CELU', 0, 0)
('test_nn_CELU_scalar', 0, 0)
('test_nn_GLU', 0, 0)
('test_nn_GLU_dim', 0, 0)
('test_nn_GELU_scalar', 0, 0)
('test_nn_GELU', 0, 0)
('test_nn_Unfold', 0, 0)
('test_nn_Fold', 0, 0)
('test_nn_Unfold_int_input', 0, 0)
('test_nn_Fold_int_input', 0, 0)
('test_nn_Threshold_threshold_value_scalar', 0, 0)
('test_nn_ReLU_scalar', 0, 0)
('test_nn_ReLU6_scalar', 0, 0)
('test_nn_RReLU_with_up_down_scalar', 0, 0)
('test_nn_Hardtanh_scalar', 0, 0)
('test_nn_Sigmoid_scalar', 0, 0)
('test_nn_Tanh_scalar', 0, 0)
('test_nn_Softmax_scalar', 0, 0)
('test_nn_LogSoftmax_multiparam_scalar', 0, 0)
('test_nn_ELU_scalar', 0, 0)
('test_nn_Hardshrink_scalar', 0, 0)
('test_nn_LeakyReLU_with_negval_scalar', 0, 0)
('test_nn_LogSigmoid_scalar', 0, 0)
('test_nn_Softplus_beta_threshold_scalar', 0, 0)
('test_nn_Softshrink_lambda_scalar', 0, 0)
('test_nn_PReLU_scalar', 0, 0)
('test_nn_Softsign_scalar', 0, 0)
('test_nn_Softmin_scalar', 0, 0)
('test_nn_Tanhshrink_scalar', 0, 0)
('test_nn_Conv1d_reflect_stride2_pad2', 3, 14)
('test_nn_Conv2d_reflect_stride2_pad2', 3, 14)
('test_nn_Conv1d_circular_stride2_pad2', 5, 31)
('test_nn_Conv2d_circular_stride2_pad2', 5, 31)
('test_nn_Conv3d_circular_stride2_pad2', 5, 31)
('test_nn_Conv1d_replicate_stride2_pad2', 3, 14)
('test_nn_Conv2d_replicate_stride2_pad2', 3, 14)
('test_nn_Conv3d_replicate_stride2_pad2', 3, 14)
('test_nn_Conv1d_zeros_stride2_pad2', 0, 0)
('test_nn_Conv2d_zeros_stride2_pad2', 0, 0)
('test_nn_Conv3d_zeros_stride2_pad2', 0, 0)
('test_nn_Bilinear', 0, 0)
('test_nn_RNNCell', 3, 14)
('test_nn_LSTMCell', 5, 22)
('test_nn_GRUCell', 3, 14)
('test_nn_MultiheadAttention', 40, 160)
('test_nn_Transformer', 128, 499)
```

Test Plan: Imported from OSS

Differential Revision: D20539336

Pulled By: eellison

fbshipit-source-id: 14ac00a7b2b029b9e57f6131dd45426b0101941a
2020-03-23 11:59:11 -07:00
4fae5a6721 Move module graph creation to testing utils (#34917)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/34917

Test Plan: Imported from OSS

Differential Revision: D20539338

Pulled By: eellison

fbshipit-source-id: 5c46c0ce50e5bcccf5abee264f432ded7d36d040
2020-03-23 11:59:02 -07:00
77ccb5c14d Move functional graph creation to testing utils (#34916)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/34916

Test Plan: Imported from OSS

Differential Revision: D20539337

Pulled By: eellison

fbshipit-source-id: 9b777e369facebbe68fe198ca3eec055cf9c5257
2020-03-23 11:57:25 -07:00
02ab6ced8e test_complex inherits from common_utils.TestCase; closes #34648 (#34697)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/34697

Differential Revision: D20596860

Pulled By: anjali411

fbshipit-source-id: 18599fce5bd3513be17ecf83ba9fb0d64d971fc4
2020-03-23 10:49:30 -07:00
21ecb8d870 Fix reference to NO_CUDA and NO_DISTRIBUTED (#34831)
Summary:
- replace the old build variables NO_CUDA and NO_DISTRIBUTED in CONTRIBUTING.md with the new USE_CUDA and USE_DISTRIBUTED versions.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34831

Differential Revision: D20512659

Pulled By: colesbury

fbshipit-source-id: 2d6cb6fd35886eec0b4b1c94f568b5137407c551
2020-03-23 10:42:01 -07:00
506996c77e [pt][quant] Optimized qadd_scalar (#34925)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34925

Optimized path for qadd scalar. qadd_scalar time goes down from 55.840ms for a model to 4.637ms.

### Before
```
  -------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------
Name                       Self CPU total %  Self CPU total   CPU total %      CPU total        CPU time avg     Number of Calls
-------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------
quantize_per_tensor        0.12%            155.807us        0.12%            155.807us        155.807us        1
quantized::conv2d          25.50%           31.981ms         25.50%           31.981ms         273.343us        117
quantized::add_scalar      44.53%           55.840ms         44.53%           55.840ms         809.281us        69
quantized::relu6           1.25%            1.570ms          1.25%            1.570ms          22.749us         69
quantized::mul_scalar      10.73%           13.449ms         10.73%           13.449ms         194.914us        69
quantized::mul             16.67%           20.904ms         16.67%           20.904ms         227.220us        92
adaptive_avg_pool2d        0.03%            41.713us         0.69%            862.922us        35.955us         24
_adaptive_avg_pool2d       0.65%            821.209us        0.65%            821.209us        34.217us         24
sigmoid                    0.15%            182.344us        0.15%            182.344us        7.928us          23
quantized::add             0.34%            431.939us        0.34%            431.939us        26.996us         16
dropout                    0.00%            1.936us          0.00%            1.936us          1.936us          1
view                       0.01%            10.281us         0.01%            10.281us         10.281us         1
dequantize                 0.00%            4.562us          0.00%            4.562us          4.562us          1
-------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------
Self CPU time total: 125.394ms
```
### After
```
 -------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------
Name                       Self CPU total %  Self CPU total   CPU total %      CPU total        CPU time avg     Number of Calls
-------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------
quantize_per_tensor        0.18%            130.534us        0.18%            130.534us        130.534us        1
quantized::conv2d          42.29%           31.267ms         42.29%           31.267ms         267.243us        117
quantized::add_scalar      6.27%            4.637ms          6.27%            4.637ms          67.205us         69
quantized::relu6           1.77%            1.312ms          1.77%            1.312ms          19.008us         69
quantized::mul_scalar      18.92%           13.991ms         18.92%           13.991ms         202.768us        69
quantized::mul             28.49%           21.059ms         28.49%           21.059ms         228.904us        92
adaptive_avg_pool2d        0.06%            45.242us         1.27%            942.522us        39.272us         24
_adaptive_avg_pool2d       1.21%            897.280us        1.21%            897.280us        37.387us         24
sigmoid                    0.22%            160.282us        0.22%            160.282us        6.969us          23
quantized::add             0.56%            416.276us        0.56%            416.276us        26.017us         16
dropout                    0.00%            1.245us          0.00%            1.245us          1.245us          1
view                       0.01%            7.122us          0.01%            7.122us          7.122us          1
dequantize                 0.01%            5.952us          0.01%            5.952us          5.952us          1
-------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------
Self CPU time total: 73.930ms
```
ghstack-source-id: 100595212

Test Plan: buck test //caffe2/test:quantized -- 'test_qadd'  --print-passing-details

Differential Revision: D20500848

fbshipit-source-id: c292d15da121e6d13cc4eb92f10549874ff6ab0f
2020-03-23 10:07:23 -07:00
3e4076aa9c [quant][graphmode] quantization work for prim::If (#34518)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/34518

Test Plan:
python test/test_jit.py

Imported from OSS

Differential Revision: D20519606

fbshipit-source-id: 94d49e18d97df642cbcb446df12376f6d2a397bc
2020-03-23 09:54:24 -07:00
0e0386b434 Revert "[JIT] add id function (#34975)" (#35209)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35209

This reverts commit 62f11f0a354690338d749fd698c9cb7281d13aae.

Test Plan: Imported from OSS

Differential Revision: D20596847

Pulled By: albanD

fbshipit-source-id: e6777e42356aac772e59f0466a92cc13258218c1
2020-03-23 08:42:09 -07:00
2c69fa93b9 Fix _copysign is not a member of std (Windows) (#35199)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35199

While running `test_cpp_extensions_aot_no_ninja` and building `rng_extension.cpp` compilation fails with:
[C:\Users\circleci\project\build\win_tmp\build\torch\include\ATen/native/Math.h(82): error C2039: '_copysign': is not a member of 'std'](https://app.circleci.com/pipelines/github/pytorch/pytorch/144367/workflows/f939ad40-273f-4492-a19e-3f602509f6f5/jobs/4907947)

this PR should fix it based on [MSDN](https://docs.microsoft.com/en-us/cpp/c-runtime-library/reference/copysign-copysignf-copysignl-copysign-copysignf-copysignl?view=vs-2019)

Test Plan: Imported from OSS

Differential Revision: D20591607

Pulled By: pbelevich

fbshipit-source-id: 4d61245cfeb37c074f0ee89027b60c581b5e08b9
2020-03-23 08:24:22 -07:00
c85697d74d [quant][graphmode][fix] use observed_values_ to check values are observed (#34571)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34571

Previously we added wrong values to observed_values_ and also it is
not used to check if a value is observed or not

Test Plan:
.

Imported from OSS

Differential Revision: D20519605

fbshipit-source-id: 6038b2539bcf7d679b7fe5c5a284b81a979934ee
2020-03-23 08:07:43 -07:00
350c522423 [quant][graphmode][refactor] insertObservers for Block (#34414)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34414

Previously we insert observers for Graph(graph is a wrapper around a Block),
this PR added insertObservers for Block, so that the code can work for the nodes that have sub blocks.

Test Plan:
.

Imported from OSS

Differential Revision: D20519604

fbshipit-source-id: 1908913ea7f0898cd7b4f2edd1f81cdfedf8a211
2020-03-23 08:07:38 -07:00
28bf0038e5 [quant][graphmode][fix] Insert dequantize before use node (#34411)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34411

To make sure dequantize and the node that uses the dequantized value reside in the
block, so that we can do quant fusion

Test Plan:
python test/test_jit.py

Imported from OSS

Differential Revision: D20519603

fbshipit-source-id: 3e4c68d0a73142716e19ea6a64ae3a5d6d51fa41
2020-03-23 08:07:33 -07:00
4caa0db6e8 [quant][graphmode][fix] preserve the type of original value when inserting dequant node (#34349)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34349

Set the output type of dequantize node to the type of original value
this is to fix swap dequantize tensor list

Test Plan:
.

Imported from OSS

Differential Revision: D20504456

fbshipit-source-id: 9064d7d598a4310e27e2914a072097526448a02c
2020-03-23 08:06:14 -07:00
358ba59f01 Add THP_API to THPGenerator_Wrap (#35194)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35194

Fixes:
```
error LNK2001: unresolved external symbol "struct _object * __cdecl THPGenerator_Wrap(struct at::Generator)" (?THPGenerator_Wrap@YAPEAU_object@UGenerator@at@@Z)
build\lib.win-amd64-3.6\torch_test_cpp_extension\rng.cp36-win_amd64.pyd : fatal error LNK1120: 1 unresolved externals
```

I forgot it, my fault

Test Plan: Imported from OSS

Differential Revision: D20591604

Pulled By: pbelevich

fbshipit-source-id: e8986948fb50aec50db99a72ad112702cbbe831f
2020-03-23 05:58:09 -07:00
36e36eff2f Ignores deliberate undefined float->int conversion (#35086)
Summary:
In C++, casting a floating point value to an integer dtype is undefined when the value is outside the dtype's dynamic range. For example, casting 300.5 to Int8 is undefined behavior because the maximum representable Int8 value is 127, and 300.5 > 127.

PyTorch, like NumPy, deliberately allows and makes these casts, however, and when we do this we trigger undefined behavior that causes our sanitizers to (correctly) complain. I propose skipping this sanitization on our cast function.

The history of this PR demonstrates the issue, showing a single CI failure in the ASAN build when a test is added that converts a large float value to an integral value. The current PR shows a green CI after the sanitization is skipped.

There are alternatives to skipping this sanitization:

- Clamping or otherwise converting floats to the dynamic range of integral types they're cast to
- Throwing a runtime error if a float value is outside the dynamic range of the integral type it's cast to (this would not be NumPy compatible)
- Declaring programs in error if they perform these casts (this is technically true)
- Preventing this happening in PyTorch proper so the ASAN build doesn't fail

None of these alternatives seems particularly appealing, and I think it's appropriate to skip the sanitization because our behavior is deliberate.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35086

Differential Revision: D20591163

Pulled By: mruberry

fbshipit-source-id: fa7a90609c73c4c627bd39726a7dcbaeeffa1d1b
2020-03-23 01:08:57 -07:00
d743c22990 Updating submodules
Summary:
GitHub commits:

58c002d159

Test Plan: n/a

Reviewed By: wittgenst

fbshipit-source-id: c286881229d1c6ce5030a4ab24ace757549578d8
2020-03-23 01:07:14 -07:00
a6672f3b30 Updating submodules
Summary:
GitHub commits:

f50c345bf5
d72a3bd5fe
39929d6fa2
49a56f7ee0

Test Plan: n/a

Reviewed By: wittgenst

fbshipit-source-id: f4ec04552d173f59ff1279dc1e2148c5c0a3f623
2020-03-22 20:03:23 -07:00
082e48e346 skip ctc_loss test on Windows (#35069)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35069

It is flaky on Windows only, so disable for now:
https://github.com/pytorch/pytorch/issues/34870

Test Plan: Imported from OSS

Differential Revision: D20544736

Pulled By: suo

fbshipit-source-id: 49e35a4b4f0d1d20157769a4dff22cb4fe86770c
2020-03-22 18:49:53 -07:00
3f2aa07b13 [ONNX] update producer version (#35059)
Summary:
Updating producer version
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35059

Reviewed By: hl475

Differential Revision: D20585173

Pulled By: houseroad

fbshipit-source-id: af0c4e3860beb899548466ea99be2050150f905d
2020-03-22 15:43:31 -07:00
1783ea43e7 [pytorch] deprecate code analyzer -closure option (#35179)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35179

Transitive dependencies are calculated in python script for both OSS custom build and BUCK selective build, so change the c++ analyzer to take -closure=false by default and remove the param from callsites.
ghstack-source-id: 100637068

Test Plan: CI

Differential Revision: D20586462

fbshipit-source-id: 195849b71cda6228a49ecd2215d3fb8b4da7f708
2020-03-22 14:36:42 -07:00
11a40410e7 pybind11 type_caster for at::Generator and custom RNG python test (#34774)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34774

This PR provides pybind11's `type_caster<at::Generator>` that allows mapping `at::Generator` instance returned from user-defined method to python `torch::Generator`, defined as `THPGenerator ` c++ class.

This allows 1) defining custom RNG in c++ extension 2) using custom RNG in python code.

`TestRNGExtension.test_rng` shows how to use custom RNG defined in `rng_extension.cpp`

Test Plan: Imported from OSS

Differential Revision: D20549451

Pulled By: pbelevich

fbshipit-source-id: 312a6deccf8228f7f60695bbf95834620d52f5eb
2020-03-22 10:57:35 -07:00
b248e23de0 Docs fix: Added missing indent (#35017)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/35017

Differential Revision: D20552491

Pulled By: yf225

fbshipit-source-id: 4481e7aecb9dc4a54ef95dfdddacbe3ff48f1c5f
2020-03-22 09:57:11 -07:00
e1c092fe3a Changes to transition to generic API for ops with weight prepacking (#35010)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35010

semantics.

This PR moves all the xnnpack specific interfces to a generic interface.
Accordingly removes xnnpac specific reference from API and some variable
names.
What has not yet changed:

TODO:
USE_XNNPACK is still used. This can be removed where no XNNPACK
specific things are done. e.g., RegisterOpContext.cpp and
xnnpack_rewrite.cpp.
Also the filename and structure also remains. Some of the generic class
definition can be moved non-XNNPACK specific folder.

Test Plan:
python test/test_xnnpack_integration.py

Imported from OSS

Differential Revision: D20526416

fbshipit-source-id: 2e1725345c44bbb26bdc448097a7384eca121387
2020-03-22 08:31:53 -07:00
1ff5d9c557 Updating submodules
Summary:
GitHub commits:

4b36034a2a

Test Plan: n/a

Reviewed By: wittgenst

fbshipit-source-id: 3c3d9fcf2fb02579b7a0cc9cebc117c8e6ec394f
2020-03-22 08:30:15 -07:00
a5b509985a Updating submodules
Summary:
GitHub commits:

4a8bc17fc7
d7c4a348e0

Test Plan: n/a

Reviewed By: wittgenst

fbshipit-source-id: d90b5234232bfe9ff5eaf6b787777f85ed8065b1
2020-03-21 23:05:18 -07:00
cfc0ff1691 Renaming: MultiLabelMarginLossFuncOptions -> MultilabelMarginLossFuncOptions, MultiLabelSoftMarginLossFuncOptions -> MultilabelSoftMarginLossFuncOptions (#35163)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35163

This PR is BC-breaking in the following way:

Renaming:
- `torch::nn::functional::MultiLabelMarginLossFuncOptions` -> `torch::nn::functional::MultilabelMarginLossFuncOptions`
- `torch::nn::functional::MultiLabelSoftMarginLossFuncOptions` -> `torch::nn::functional::MultilabelSoftMarginLossFuncOptions`

Reason for renaming: to be consistent with the corresponding functional name after camel case to snake case conversion (e.g. the `multilabel_margin_loss` functional should use `MultilabelMarginLossFuncOptions` as options)

Test Plan: Imported from OSS

Differential Revision: D20582598

Pulled By: yf225

fbshipit-source-id: 0f5bdb8249d901b310875a14320449a2fdfa8ecd
2020-03-21 18:34:46 -07:00
5306713a36 Replace Generator* with Generator that holds std::shared_ptr<GeneratorImpl> (#34468)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34468

This PR prepares `at::Generator` for pybind11's `type_caster<at::Generator>` which is required to implement custom RNG in python. The following changes are done:
1. `at::Generator` was moved to `c10::GeneratorImpl` (similar to `c10::TensorImpl`)
2. `at::Generator` was recreated as a holder of `std::shared_ptr<c10::GeneratorImpl>` (similar to `at::Tensor` that holds `c10::intrusive_ptr<c10::TensorImpl>`)
3. Most of `at::Generator*` usages were replaced with `at::Generator`

TBD: replacing `Generator generator = nullptr` with `{}` requires JIT changes(adding Generator to IValue?)

Differential Revision: D20549420

Pulled By: pbelevich

fbshipit-source-id: 4c92a40eab8f033b359bb6c93f4cd84b07ee8d4e
2020-03-21 17:36:10 -07:00
a100cf5146 Revert D20541090: [JIT][torchbind] Namespaces for torchbind classes
Test Plan: revert-hammer

Differential Revision:
D20541090

Original commit changeset: ce3d9391dd3c

fbshipit-source-id: acc1d660fbda611941381315507dfe594c385db1
2020-03-21 12:20:44 -07:00
bbec4520c6 Add inplace tests for several torch::nn modules / functionals (#35147)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/35147

Test Plan: Imported from OSS

Differential Revision: D20578217

Pulled By: yf225

fbshipit-source-id: b8bafa49ee94c7dfbbca6e100ee3d9df5b2b621c
2020-03-21 10:02:56 -07:00
f515d87296 Updating submodules
Summary:
GitHub commits:

b816974e2e
fff5d32fbc

Test Plan: n/a

Reviewed By: wittgenst

fbshipit-source-id: 5e210ce767d542c13ab1c43cbd38bf87ec7f95df
2020-03-21 04:11:47 -07:00
95ad94c75b [TensorExpr] Nuke tensorexpr::schedule namespace. (#35126)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/35126

Test Plan: Imported from OSS

Differential Revision: D20569364

Pulled By: ZolotukhinM

fbshipit-source-id: c0d51ecadf411918641cdbdc6d8cb06e207d2c9b
2020-03-20 23:39:14 -07:00
d609f356de [TensorExpr] Use const Expr* instead of ExprHandle& in Range. (#35125)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/35125

Test Plan: Imported from OSS

Differential Revision: D20568555

Pulled By: ZolotukhinM

fbshipit-source-id: 5f5467641eff9e864831486a2f1ff097281ad0b0
2020-03-20 23:39:09 -07:00
65cea95777 [TensorExpr] Rename schedule.{cpp,h} to loopnest.{cpp,h}. (#35119)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/35119

Differential Revision: D20567927

Test Plan: Imported from OSS

Pulled By: ZolotukhinM

fbshipit-source-id: 1fb6d03bd4c6e66aca62140d2b537692577f261d
2020-03-20 23:37:51 -07:00
3342ab89ac [pytorch] revert register c10 ops for static dispatch (#35148)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35148

PR #34275 (commit 064c47845380715e290eb335919a18fe3821ee83) causes size
regression for BUCK build before BUCK selective build is enabled.

This PR reverts partially (adding back #ifndef USE_STATIC_DISPATCH) to
fix the size regression. Will wait for BUCK selective build change to
land and soak then revert this revert.

Test Plan: Imported from OSS

Differential Revision: D20578316

Pulled By: ljk53

fbshipit-source-id: 694f01ec7a69fe3758a389e22e9de20ecd867962
2020-03-20 23:13:01 -07:00
3fa7813b9f [quant] Add dequantize.tensors (#34348)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34348

We need this function to do swap dequantize for prim::ListConstruct since
the output of prim::ListConstruct is a list of Tensors

Test Plan:
.

Imported from OSS

Differential Revision: D20504454

fbshipit-source-id: e6155e37da98e2219a6f79737cd46fe32a509c9f
2020-03-20 22:51:51 -07:00
d87750cd04 [caffe2.proto] Add backend_option to PartitionInfo
Summary: Att

Test Plan: Updated C2 importer test in stack.

Reviewed By: yinghai, bangshengtang

Differential Revision: D20527162

fbshipit-source-id: cf3d59089b651565db74f2a52af01f26fdfcbca6
2020-03-20 22:43:50 -07:00
a2557970f3 Fix F::interpolate and torch::nn::Upsample implementation (#35025)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35025

This PR fixes `F::interpolate` and `torch::nn::Upsample` implementation to match the Python API implementation.

**This PR is BC-breaking in the following way:**

There are changes to `UpsampleOptions` and `InterpolateFuncOptions`:
- `size` is changed from `std::vector<int64_t>` to `c10::optional<std::vector<int64_t>>`. If you want to pass a list of `int64_t` to this argument, you must pass it as `std::vector<int64_t>`.
- `scale_factor` is changed from `std::vector<double>` to `c10::optional<std::vector<double>>`. If you want to pass a list of `double` to this argument, you must pass it as `std::vector<double>`.

**TODO**: cherry-pick this PR into v1.5 release branch.

Test Plan: Imported from OSS

Differential Revision: D20559892

Pulled By: yf225

fbshipit-source-id: ac18609e351a9f2931eaeced8966b9491b2995f7
2020-03-20 22:37:13 -07:00
c0958c883e Fix fractional_max_pool3d_with_indices implementation (#35024)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35024

**TODO**: cherry-pick this PR into v1.5 release branch.

Test Plan: Imported from OSS

Differential Revision: D20559891

Pulled By: yf225

fbshipit-source-id: c2b5c005c0bd560b5a84d4cc9097dbd64ee902c0
2020-03-20 22:37:08 -07:00
ef7fe371ce Fix Conv and ConvTranspose implementation (#35023)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35023

This PR fixes Conv and ConvTranspose implementation to match the Python API implementation.

**TODO**: cherry-pick this PR into v1.5 release branch.

Test Plan: Imported from OSS

Differential Revision: D20559889

Pulled By: yf225

fbshipit-source-id: 53783a7398ef968ec6d25b6f568fde44907417c5
2020-03-20 22:37:03 -07:00
d7462dcea6 Fix AdaptiveAvgPool{2,3}d and AdaptiveMaxPool{2,3}d implementation (#35022)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35022

This PR fixes `AdaptiveAvgPool{2,3}d` and `AdaptiveMaxPool{2,3}d` implementation to match the Python API implementation. Particularly, `output_size` is changed to accept `c10::nullopt` in its elements, matching the Python API behavior.

**TODO**: cherry-pick this PR into v1.5 release branch.

Test Plan: Imported from OSS

Differential Revision: D20559890

Pulled By: yf225

fbshipit-source-id: ccddbd278dd39165cf1dda11fc0e49387c76dbef
2020-03-20 22:36:57 -07:00
845b19c4ef Add weight_scale in Adagrad (#34944)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/34944

Reviewed By: chonglinsun

Differential Revision: D20506032

fbshipit-source-id: ef025e536da01fdcabc783466bc065685b80ab9a
2020-03-20 22:36:51 -07:00
c21fde6421 [jit] make jit/rpc share the same PythonFutureWrapper (#35039)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35039

This is the initial step towards merging ivalue future and rpc future

Test Plan: Imported from OSS

Differential Revision: D20537164

Pulled By: wanchaol

fbshipit-source-id: d4f148c88e49ed6b0881ca4b4dd945ea24166183
2020-03-20 22:35:34 -07:00
43fc97db88 Updating submodules
Summary:
GitHub commits:

db31580d1b

Test Plan: n/a

Reviewed By: wittgenst

fbshipit-source-id: e5232f60b99f57bbdef30653feac2a87ef9560d2
2020-03-20 20:08:21 -07:00
d45e135d89 Updating submodules
Summary:
GitHub commits:

cd8de9ff9f
b335c45643
cc1d36f0cd
05177629a2
20e91cc072
3ee0b7d56f

Test Plan: n/a

Reviewed By: wittgenst

fbshipit-source-id: eea82084992bb504e378163594ad6b06822e51a7
2020-03-20 20:08:14 -07:00
4594433319 Add retry to pip usage in mobile job (#35122)
Summary:
To reduce flakiness.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35122

Differential Revision: D20573303

Pulled By: kostmo

fbshipit-source-id: ef99abe3910498c1da309c4616067055738425f5
2020-03-20 20:08:07 -07:00
bf31b1b6be Upgrade protobuf as bazel build preamble (#34662)
Summary:
The protobuf bazel definitions are incompatible with recent bazel
versions, so as a prerequisite for any bazel build of pytorch, a more
recent protobuf must be used.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34662

Differential Revision: D20570425

Pulled By: malfet

fbshipit-source-id: ed4de3eb3fe05f076df93db7175954e797791300
2020-03-20 20:07:59 -07:00
e98b8eb35f [profiler] remove unused _push_range and _pop_range (#35028)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35028

removes these methods that are not used anywhere in the codebase. With this we can also remove public declaration of TORCH_API popRange and TORCH_API pushRange since those were the only use cases.
ghstack-source-id: 100560207

Test Plan: CI

Differential Revision: D20531148

fbshipit-source-id: 8ceaf64449c77259a582a38b1137827ff1ab07f7
2020-03-20 20:07:53 -07:00
4025729e88 [1.5 Release][RPC Reliability] RRef Idempotency and RPC Retry enablement (#33636)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33636

Fixes https://github.com/pytorch/pytorch/issues/32119, https://github.com/pytorch/pytorch/issues/26116,
https://github.com/pytorch/pytorch/issues/33072

Makes RRef control messages idempotent and enables sending with retries for distributed autograd cleanup and RRef internal messages.

In order to effectively test whether these RRef and distributed autograd cleanup work with network failures/retries, I implemented an  RPC Agent with a faulty send function, and enabled running tests using this as a third backend (in addition to Thrift and PGA). The tests using this backend are in a separate class (the test cases are similar but with minor changes to ensure short-running tests wait for retried RPCs to finish).

This faulty RPC Agent is pretty configurable. The tests can configure which messages types to fail, and how many messages to fail, but going forward, other RPC functionality can be overriden with faulty methods to test with failures injected.

Differential Revision: D20019236

fbshipit-source-id: 540a977e96b2e29aa0393ff12621fa293fe92b48
2020-03-20 20:07:47 -07:00
61b680c012 [pytorch] force c10 schema registration for custom build
Summary:
PR #32521 has several issues with mobile builds:
1. It didn't work with static dispatch (which OSS mobile build currently uses);
2. PR #34275 fixed 1) but it doesn't fix custom build for #32521;
3. manuallyBoxedKernel has a bug with ops which only have catchAllKernel: 2d7ede5f71

Both 1) and 2) have similar root cause - some JIT side code expects certain schemas to be registered in JIT registry.
For example: considering this code snippet: https://github.com/pytorch/pytorch/blob/master/torch/csrc/jit/frontend/builtin_functions.cpp#L10
```
auto scalar_operators_source = CodeTemplate(
    R"SCRIPT(
def mul(a : ${Scalar}, b : Tensor) -> Tensor:
  return b * a
...
```

It expects "aten::mul.Scalar(Tensor self, Scalar other) -> Tensor" to be registered in JIT - it doesn't necessarily need to call the implementation, though; otherwise it will fail some type check: https://github.com/pytorch/pytorch/pull/34013#issuecomment-592982889

Before #32521, all JIT registrations happen in register_aten_ops_*.cpp generated by gen_jit_dispatch.py.
After #32521, for ops with full c10 templated boxing/unboxing support, JIT registrations happen in TypeDefault.cpp/CPUType.cpp/... generated by aten/gen.py, with c10 register API via RegistrationListener in register_c10_ops.cpp. However, c10 registration in TypeDefault.cpp/CPUType.cpp/... are gated by `#ifndef USE_STATIC_DISPATCH`, thus these schemas won't be registered in JIT registry when USE_STATIC_DISPATCH is enabled.

PR #34275 fixes the problem by moving c10 registration out of `#ifndef USE_STATIC_DISPATCH` in TypeDefault.cpp/CPUType.cpp/..., so that all schemas can still be registered in JIT. But it doesn't fix custom build, where we only keep c10 registrations for ops used by specific model directly (for static dispatch custom build) and indirectly (for dynamic dispatch custom build). Currently there is no way for custom build script to know things like "aten::mul.Scalar(Tensor self, Scalar other) -> Tensor" needs to be kept, and in fact the implementation is not needed, only schema needs to be registered in JIT.

Before #32521, the problem was solved by keeping a DUMMY placeholder for unused ops in register_aten_ops_*.cpp: https://github.com/pytorch/pytorch/blob/master/tools/jit/gen_jit_dispatch.py#L326
After #32521, we could do similar thing by forcing aten/gen.py to register ALL schema strings for selective build - which is what is PR is doing.

Measured impact on custom build size (for MobileNetV2):
```
SELECTED_OP_LIST=MobileNetV2.yaml scripts/build_pytorch_android.sh armeabi-v7a
```
Before: 3,404,978
After:  3,432,569

~28K compressed size increase due to including more schema strings.

The table below summarizes the relationship between codegen flags and 5 build configurations that are related to mobile:
```
+--------------------------------------+-----------------------------------------------------------------------------+--------------------------------------------+
|                                      |                              Open Source                                    |                  FB BUCK                   |
+--------------------------------------+---------------------+---------------------------+---------------------------+---------------+----------------------------+
|                                      |    Default Build    | Custom Build w/ Stat-Disp | Custom Build w/ Dyna-Disp |   Full-JIT    |         Lite-JIT           |
+--------------------------------------+---------------------+---------------------------+---------------------------+---------------+----------------------------+
| Dispatch Type                        | Static              | Static                    | Dynamic                   | Dynamic (WIP) | Dynamic (WIP)              |
+--------------------------------------+---------------------+---------------------------+---------------------------+---------------+----------------------------+
| ATen/gen.py                          |                     |                           |                           |               |                            |
+--------------------------------------+---------------------+---------------------------+---------------------------+---------------+----------------------------+
| --op_registration_whitelist          | unset               | used root ops             | closure(used root ops)    | unset         | closure(possibly used ops) |
| --backend_whitelist                  | CPU Q-CPU           | CPU Q-CPU                 | CPU Q-CPU                 | CPU Q-CPU     | CPU Q-CPU                  |
| --per_op_registration                | false               | false                     | false                     | false         | true                       |
| --force_schema_registration          | false               | true                      | true                      | false         | false                      |
+--------------------------------------+---------------------+---------------------------+---------------------------+---------------+----------------------------+
| tools/setup_helpers/generate_code.py |                     |                           |                           |               |                            |
+--------------------------------------+---------------------+---------------------------+---------------------------+---------------+----------------------------+
| --disable-autograd                   | true                | true                      | true                      | false         | WIP                        |
| --selected-op-list-path              | file(used root ops) | file(used root ops)       | file(used root ops)       | unset         | WIP                        |
| --disable_gen_tracing                | false               | false                     | false                     | false         | WIP                        |
+--------------------------------------+---------------------+---------------------------+---------------------------+---------------+----------------------------+
```

Differential Revision: D20397421

Test Plan: Imported from OSS

Pulled By: ljk53

fbshipit-source-id: 906750949ecacf68ac1e810fd22ee99f2e968d0b
2020-03-20 20:07:34 -07:00
064c478453 [pytorch] register c10 ops for static dispatch to unblock c10 boxing
Summary:
PR #32521 broke static dispatch because some ops are no longer
registered in register_aten_ops_*.cpp - it expects the c10 registers in
TypeDefault.cpp / CPUType.cpp / etc to register these ops. However, all
c10 registers are inside `#ifndef USE_STATIC_DISPATCH` section.

To measure the OSS mobile build size impact of this PR:
```
 # default build: SELECTED_OP_LIST=MobileNetV2.yaml scripts/build_pytorch_android.sh armeabi-v7a
 # mobilenetv2 custom build: scripts/build_pytorch_android.sh armeabi-v7a
```

- Before this PR, Android AAR size for arm-v7:
* default build: 5.5M;
* mobilenetv2 custom build: 3.2M;

- After this PR:
* default build: 6.4M;
* mobilenetv2 custom build: 3.3M;

It regressed default build size by ~1M because more root ops are
registered by c10 registers, e.g. backward ops which are filtered out by
gen_jit_dispatch.py for inference-only mobile build.

mobilenetv2 custom build size regressed by ~100k presumably because
the op whitelist is not yet applied to things like BackendSelectRegister.

Differential Revision: D20266240

Test Plan: Imported from OSS

Pulled By: ljk53

fbshipit-source-id: 97a9a06779f8c62fe3ff5cce089aa7fa9dee3c4a
2020-03-20 20:07:15 -07:00
3a772b798a Auto-format jit/rpc_test.py with flake8-black (#35075)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35075

as titled

Test Plan: `

Differential Revision: D7684240

fbshipit-source-id: e883bd2357164e204cd433d4a1ad4da643a03fe4
2020-03-20 20:07:09 -07:00
e0496a70fc [JIT][torchbind] Namespaces for torchbind classes (#35054)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/35054

Test Plan: Imported from OSS

Differential Revision: D20541090

Pulled By: jamesr66a

fbshipit-source-id: ce3d9391dd3cdf619042b8f6ba2645f4c1fc875c
2020-03-20 20:07:02 -07:00
65ff064763 Parallelize cpu index_put accumulate float path with cpu_atomic_add_float (#29705)
Summary:
This is try to parallelize index_put accumulate path for float type on CPU. cpu_atomic_add_float is implemented by using atomic_compare_exchange_strong function.
for [DLRM](https://github.com/facebookresearch/dlrm) benchmark, _index_put_impl_ function time can be reduced from 827.741ms to 116.646ms for 1000 batches

Add a parameter "grain_size" to TensorIterator::for_each to fine tune the index_put performance
The default value of grain_size is internal::GRAIN_SIZE. The index_put grain size is tuned to 3000 and cpu_kernel_vec grain size is tuned to 1024. The following is the grain size impact on the DLRM ops
( _index_put_impl_ based on index_put been parallellized with cpu_atomic_add_float):

|  Op Name               | without small grain_size      | with 1024 as grain_size in cpu_kernel_vec and 3000 in cpu_index_kernel    |
|-----------------|----------:|----------:|
|  add_  | 11.985s | 11.601s |
| mm            | 9.706s   |   9.518s |
| addmm         | 5.380s   | 5.247s |
| _embedding_bag         | 2.992s   | 2.663s |
| _embedding_bag_backward         | 1.330s   | 1.354s |
| threshold_backward         | 686.920ms   | 659.169ms |
| _index_put_impl_         | 489.411ms   | 116.646ms |
| bmm         | 413.129ms   | 362.967ms |
| zero_         | 379.659ms   | 310.623ms |
| add         | 205.904ms   | 171.111ms |
| cat         | 187.101ms   | 175.621ms |
| Self CPU time total (s)         | 36.544   | 34.742 |
| Average ms per iteration         | 38.25   | 36.44 |

The more reason for grain size tuning, please further look at [PR#30803](https://github.com/pytorch/pytorch/issues/30803)
to get the DLRM performance here, please also have a look at
[PR#23057](https://github.com/pytorch/pytorch/pull/23057), [PR#24385](https://github.com/pytorch/pytorch/pull/24385) and [PR#27804](https://github.com/pytorch/pytorch/pull/27804)
and expose the env vars as below:
```
export LD_PRELOAD=$HOME/anaconda3/lib/libjemalloc.so  (conda install jemalloc)
export KMP_BLOCKTIME=1
export KMP_AFFINITY="granularity=fine,compact,1,0"
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/29705

Differential Revision: D19777742

Pulled By: VitalyFedyunin

fbshipit-source-id: a8222fe6089b6bf56b674e35f790508ad05385c0
2020-03-20 20:06:54 -07:00
3e58cba3c5 Fixes the Conv2d batch_norm folding for various cases. (#34932)
Summary:
This PR adds a preprocessing step in Conv2dBatchNorm folding.
It traverses the module to check if the bias of Conv2d module is set to
None. If it is, it assume that this a traced module and insert
Optional[Tensor] type bias.
Furthermore it insert getAttr for bias in the forward graph and fixes
_convolution op to take values from getAttr.
It also fixes parametere extraction from BN module which may not
have weight and bias attributes if affine was set to False. In scripted
mode such a BN module will get weight and bias attributes set to None.
For the case of eps it gets const propagated in tracing. This is also
fixed.
Few tests cases are added.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34932

Test Plan:
python test/test_jit.py TestJit.test_foldbn_trivial
python test/test_jit.py TestJit.test_foldbn_trivial_nobias
python test/test_jit.py TestJit.test_foldbn_in_submodule
python test/test_jit.py TestJit.test_foldbn_shared_classtype
python test/test_jit.py TestJit.test_foldbn_complex_cases
python test/test_jit.py TestJit.test_nofoldbn_complex_cases

Differential Revision: D20536478

Pulled By: kimishpatel

fbshipit-source-id: 4e842976a380d0575a71001bb4481390c08c259e
2020-03-20 20:06:44 -07:00
df8d6eeb19 Update docs about DP and DDP for CUDA (#35063)
Summary:
We should recommend DDP instead of DP. Hope we can also cherry-pick this for 1.5
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35063

Differential Revision: D20549621

Pulled By: ngimel

fbshipit-source-id: 86b1b2134664065cc6070ea4212895f993eaf543
2020-03-20 20:06:37 -07:00
f9cddff25a [android] Preload module actions do only once (#32313)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32313

`torch::autograd::profiler::pushCallback()`, `torch::jit::setPrintHandler` should be called only once, not before every loading

`JITCallGuard guard;` not needed before loading module and has no effect

Test Plan: Imported from OSS

Differential Revision: D20559676

Pulled By: IvanKobzarev

fbshipit-source-id: 70cce5d2dda20a00b378639725294cb3c440bad2
2020-03-20 20:06:25 -07:00
4bd5d1b3be [TVM] Use caffe2_predictor_model_shape_hints to pass shape_hints to TVM (#35091)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/35091

Test Plan:
AI/AF canary to make sure it does not affect production:

https://our.intern.facebook.com/intern/ads/canary/425387509869003921/
https://our.intern.facebook.com/intern/ads/canary/425387881631488449/

Glow:

```
buck test glow:
```

Reviewed By: yinghai

Differential Revision: D20552830

fbshipit-source-id: bdf65fb0ba945963a7c9621cc3f7ea5ebaecb907
2020-03-20 20:06:17 -07:00
ca1e2cda05 Port set_ to ATen. (#34403)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/34403

Test Plan: Imported from OSS

Differential Revision: D20342300

Pulled By: gchanan

fbshipit-source-id: 8dd223b19539137fab36dd0e751b19e0b4507959
2020-03-20 20:06:11 -07:00
62f11f0a35 [JIT] add id function (#34975)
Summary:
add `id` function so to give uses a way of keeping a `seen` set of nn modules.
n practice, this is only used between values of `T` and `T` or `T` and `Optional[T]`, so in this implementation I made it so that None is the only value that can be zero. Python also only guarantees `id()` gives semantically meaningful results for pointer types.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34975

Differential Revision: D20549677

Pulled By: eellison

fbshipit-source-id: cca5ed4ef013f7540f93abf49f91f9830dfdca14
2020-03-20 20:03:10 -07:00
12f0052eee Add TensorExpr Fuser tests (resubmit). (#35085)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/35085

Test Plan: Imported from OSS

Differential Revision: D20552334

Pulled By: ZolotukhinM

fbshipit-source-id: 628fcf4719a879f18978ff8a0a64afbb045df645
2020-03-20 13:19:31 -07:00
3c409fc66c Add guard elimination cases for operators encountered on an RL workload. (#34967)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/34967

Differential Revision: D20547478

Pulled By: resistor

fbshipit-source-id: da7df159fd6098d0f1278b8088bbbe6717b79cfc
2020-03-20 13:04:44 -07:00
faa853fefb Revert D20254663: [pytorch][PR] Vectorize in-place comparison operators
Test Plan: revert-hammer

Differential Revision:
D20254663

Original commit changeset: 68b7109ec435

fbshipit-source-id: 73474d88a7bb96448428ea5ff780e77163a00f88
2020-03-20 13:02:21 -07:00
ea41bf3100 [android] Maven publishing license fix (#32474)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/32474

Test Plan: Imported from OSS

Differential Revision: D20559815

Pulled By: IvanKobzarev

fbshipit-source-id: 69a4fe951d331eb311bf821f94b372ccecdf1fd6
2020-03-20 12:27:08 -07:00
8998a1b3d3 Add tensorexpr benchmarks. (#35064)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/35064

Test Plan: Imported from OSS

Differential Revision: D20543695

Pulled By: ZolotukhinM

fbshipit-source-id: 1cf294ab19465cb93557c2b195252c739b40a0f7
2020-03-20 12:01:31 -07:00
bf41a7624e fix missing comma in activation benchmarks (#35104)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35104

I missed this in https://github.com/pytorch/pytorch/pull/34959
after a rebase, fixing.

Test Plan:
running benchmarks no longer crashes
CI

Imported from OSS

Differential Revision: D20560908

fbshipit-source-id: a5494e23953d3c9007e9874d673896291b5322e0
2020-03-20 11:36:05 -07:00
7d5a899883 randn cuda kernel complex dtype (#35056)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/35056

Differential Revision: D20559396

Pulled By: anjali411

fbshipit-source-id: 64b911f893e9c54aef89e8c1e643998d8b70e613
2020-03-20 11:19:08 -07:00
451e4d578d Define +, -, *, / between complex numbers and integers (#34506)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/34506

Test Plan: Imported from OSS

Differential Revision: D20559619

Pulled By: anjali411

fbshipit-source-id: c63cb3c07f694c10328fc17f99d69d7134e5c67a
2020-03-20 11:17:03 -07:00
91d39de149 Vectorize in-place comparison operators (#33252)
Summary:
Benchmark: (Debian 10, Release build, gcc 8.3, no turbo, Intel(R) Xeon(R) E-2136 CPU @ 3.30GHz)

```python
import timeit
for op in ('gt', 'lt', 'ge', 'le', 'eq', 'ne'):
    for dtype in ('torch.float', 'torch.double', 'torch.int16', 'torch.int32', 'torch.int64'):
        for n, t in [(10_000, 100000),
                    (100_000, 10000)]:
            print(f'a.{op}_(b), numel() == {n} for {t} times, dtype={dtype}')
            print(timeit.timeit(f'a.{op}_(b)', setup=f'import torch; a = torch.arange(1, {n}, dtype={dtype}); b = torch.arange({n}, 1, -1, dtype={dtype})', number=t))
```

Before:

```
a.gt_(b), numel() == 10000 for 100000 times, dtype=torch.float
0.778998922000028
a.gt_(b), numel() == 100000 for 10000 times, dtype=torch.float
0.6359690249992127
a.gt_(b), numel() == 10000 for 100000 times, dtype=torch.double
1.0801493119997758
a.gt_(b), numel() == 100000 for 10000 times, dtype=torch.double
0.9360321379990637
a.gt_(b), numel() == 10000 for 100000 times, dtype=torch.int16
0.7341018620008981
a.gt_(b), numel() == 100000 for 10000 times, dtype=torch.int16
0.6345281440007966
a.gt_(b), numel() == 10000 for 100000 times, dtype=torch.int32
0.7396387640001194
a.gt_(b), numel() == 100000 for 10000 times, dtype=torch.int32
0.6429641230006382
a.gt_(b), numel() == 10000 for 100000 times, dtype=torch.int64
0.7759611700003006
a.gt_(b), numel() == 100000 for 10000 times, dtype=torch.int64
0.6672059659995284
a.lt_(b), numel() == 10000 for 100000 times, dtype=torch.float
0.7724312530008319
a.lt_(b), numel() == 100000 for 10000 times, dtype=torch.float
0.6392585769990546
a.lt_(b), numel() == 10000 for 100000 times, dtype=torch.double
0.7917451840003196
a.lt_(b), numel() == 100000 for 10000 times, dtype=torch.double
0.6455550159989798
a.lt_(b), numel() == 10000 for 100000 times, dtype=torch.int16
0.739991647998977
a.lt_(b), numel() == 100000 for 10000 times, dtype=torch.int16
0.6572993859990675
a.lt_(b), numel() == 10000 for 100000 times, dtype=torch.int32
0.7627949479992822
a.lt_(b), numel() == 100000 for 10000 times, dtype=torch.int32
0.6476544910001394
a.lt_(b), numel() == 10000 for 100000 times, dtype=torch.int64
0.7965036850000615
a.lt_(b), numel() == 100000 for 10000 times, dtype=torch.int64
0.6780715599998075
a.ge_(b), numel() == 10000 for 100000 times, dtype=torch.float
0.7653547080008138
a.ge_(b), numel() == 100000 for 10000 times, dtype=torch.float
0.6383065829995758
a.ge_(b), numel() == 10000 for 100000 times, dtype=torch.double
0.7895260240002244
a.ge_(b), numel() == 100000 for 10000 times, dtype=torch.double
0.6508346030004759
a.ge_(b), numel() == 10000 for 100000 times, dtype=torch.int16
0.7409299750015634
a.ge_(b), numel() == 100000 for 10000 times, dtype=torch.int16
0.6383492870008922
a.ge_(b), numel() == 10000 for 100000 times, dtype=torch.int32
0.7620547579990671
a.ge_(b), numel() == 100000 for 10000 times, dtype=torch.int32
0.6474270239996258
a.ge_(b), numel() == 10000 for 100000 times, dtype=torch.int64
0.8070051169997896
a.ge_(b), numel() == 100000 for 10000 times, dtype=torch.int64
0.6712598600006459
a.le_(b), numel() == 10000 for 100000 times, dtype=torch.float
0.7627660060006747
a.le_(b), numel() == 100000 for 10000 times, dtype=torch.float
0.6406353189995571
a.le_(b), numel() == 10000 for 100000 times, dtype=torch.double
1.0826010620003217
a.le_(b), numel() == 100000 for 10000 times, dtype=torch.double
0.9391552950000914
a.le_(b), numel() == 10000 for 100000 times, dtype=torch.int16
0.7427801039993938
a.le_(b), numel() == 100000 for 10000 times, dtype=torch.int16
0.6365172640016681
a.le_(b), numel() == 10000 for 100000 times, dtype=torch.int32
0.7679271510005492
a.le_(b), numel() == 100000 for 10000 times, dtype=torch.int32
0.6453389289999905
a.le_(b), numel() == 10000 for 100000 times, dtype=torch.int64
0.788032889000533
a.le_(b), numel() == 100000 for 10000 times, dtype=torch.int64
0.6708840760002204
a.eq_(b), numel() == 10000 for 100000 times, dtype=torch.float
1.078837263999958
a.eq_(b), numel() == 100000 for 10000 times, dtype=torch.float
0.9397531720005645
a.eq_(b), numel() == 10000 for 100000 times, dtype=torch.double
1.1031508050000411
a.eq_(b), numel() == 100000 for 10000 times, dtype=torch.double
0.9412319389994082
a.eq_(b), numel() == 10000 for 100000 times, dtype=torch.int16
0.7509566959997755
a.eq_(b), numel() == 100000 for 10000 times, dtype=torch.int16
0.638570957000411
a.eq_(b), numel() == 10000 for 100000 times, dtype=torch.int32
0.7592877549996047
a.eq_(b), numel() == 100000 for 10000 times, dtype=torch.int32
0.6458840529994632
a.eq_(b), numel() == 10000 for 100000 times, dtype=torch.int64
0.7984061539991671
a.eq_(b), numel() == 100000 for 10000 times, dtype=torch.int64
0.6776346309998189
a.ne_(b), numel() == 10000 for 100000 times, dtype=torch.float
0.7724407899986545
a.ne_(b), numel() == 100000 for 10000 times, dtype=torch.float
0.6581534130000364
a.ne_(b), numel() == 10000 for 100000 times, dtype=torch.double
0.8303323249983805
a.ne_(b), numel() == 100000 for 10000 times, dtype=torch.double
0.6954390920000151
a.ne_(b), numel() == 10000 for 100000 times, dtype=torch.int16
0.745512373998281
a.ne_(b), numel() == 100000 for 10000 times, dtype=torch.int16
0.6360954970004968
a.ne_(b), numel() == 10000 for 100000 times, dtype=torch.int32
0.7569978400006221
a.ne_(b), numel() == 100000 for 10000 times, dtype=torch.int32
0.6450422030011396
a.ne_(b), numel() == 10000 for 100000 times, dtype=torch.int64
0.7889118379989668
a.ne_(b), numel() == 100000 for 10000 times, dtype=torch.int64
0.6693385389989999
```

After:

```
a.gt_(b), numel() == 10000 for 100000 times, dtype=torch.float
0.2444220920006046
a.gt_(b), numel() == 100000 for 10000 times, dtype=torch.float
0.2031730359994981
a.gt_(b), numel() == 10000 for 100000 times, dtype=torch.double
0.35491806199934217
a.gt_(b), numel() == 100000 for 10000 times, dtype=torch.double
0.3905606850003096
a.gt_(b), numel() == 10000 for 100000 times, dtype=torch.int16
0.16665379499863775
a.gt_(b), numel() == 100000 for 10000 times, dtype=torch.int16
0.10095906300011848
a.gt_(b), numel() == 10000 for 100000 times, dtype=torch.int32
0.21650469999985944
a.gt_(b), numel() == 100000 for 10000 times, dtype=torch.int32
0.18737469400002738
a.gt_(b), numel() == 10000 for 100000 times, dtype=torch.int64
0.35481256200000644
a.gt_(b), numel() == 100000 for 10000 times, dtype=torch.int64
0.36696120199849247
a.lt_(b), numel() == 10000 for 100000 times, dtype=torch.float
0.21976138800164335
a.lt_(b), numel() == 100000 for 10000 times, dtype=torch.float
0.20275393200063263
a.lt_(b), numel() == 10000 for 100000 times, dtype=torch.double
0.3695997209997586
a.lt_(b), numel() == 100000 for 10000 times, dtype=torch.double
0.39441510399956314
a.lt_(b), numel() == 10000 for 100000 times, dtype=torch.int16
0.15657078300137073
a.lt_(b), numel() == 100000 for 10000 times, dtype=torch.int16
0.0992998069996247
a.lt_(b), numel() == 10000 for 100000 times, dtype=torch.int32
0.20425128799979575
a.lt_(b), numel() == 100000 for 10000 times, dtype=torch.int32
0.20352934599941364
a.lt_(b), numel() == 10000 for 100000 times, dtype=torch.int64
0.35883567900054913
a.lt_(b), numel() == 100000 for 10000 times, dtype=torch.int64
0.39059587599876977
a.ge_(b), numel() == 10000 for 100000 times, dtype=torch.float
0.21457727400047588
a.ge_(b), numel() == 100000 for 10000 times, dtype=torch.float
0.18836135499986995
a.ge_(b), numel() == 10000 for 100000 times, dtype=torch.double
0.35971907199927955
a.ge_(b), numel() == 100000 for 10000 times, dtype=torch.double
0.3688875009993353
a.ge_(b), numel() == 10000 for 100000 times, dtype=torch.int16
0.1576009280015569
a.ge_(b), numel() == 100000 for 10000 times, dtype=torch.int16
0.09524034199966991
a.ge_(b), numel() == 10000 for 100000 times, dtype=torch.int32
0.2064543649994448
a.ge_(b), numel() == 100000 for 10000 times, dtype=torch.int32
0.18726435600001423
a.ge_(b), numel() == 10000 for 100000 times, dtype=torch.int64
0.35351785300008487
a.ge_(b), numel() == 100000 for 10000 times, dtype=torch.int64
0.3680737989998306
a.le_(b), numel() == 10000 for 100000 times, dtype=torch.float
0.2132134399998904
a.le_(b), numel() == 100000 for 10000 times, dtype=torch.float
0.2140274829998816
a.le_(b), numel() == 10000 for 100000 times, dtype=torch.double
0.36539215199991304
a.le_(b), numel() == 100000 for 10000 times, dtype=torch.double
0.39128020300086064
a.le_(b), numel() == 10000 for 100000 times, dtype=torch.int16
0.15712150600120367
a.le_(b), numel() == 100000 for 10000 times, dtype=torch.int16
0.10149904400168452
a.le_(b), numel() == 10000 for 100000 times, dtype=torch.int32
0.2103407699996751
a.le_(b), numel() == 100000 for 10000 times, dtype=torch.int32
0.2134442910009966
a.le_(b), numel() == 10000 for 100000 times, dtype=torch.int64
0.35387034300038067
a.le_(b), numel() == 100000 for 10000 times, dtype=torch.int64
0.38917528399906587
a.eq_(b), numel() == 10000 for 100000 times, dtype=torch.float
0.2190484450002259
a.eq_(b), numel() == 100000 for 10000 times, dtype=torch.float
0.2030815980015177
a.eq_(b), numel() == 10000 for 100000 times, dtype=torch.double
0.3710030169986567
a.eq_(b), numel() == 100000 for 10000 times, dtype=torch.double
0.36419657899932645
a.eq_(b), numel() == 10000 for 100000 times, dtype=torch.int16
0.15986497499943653
a.eq_(b), numel() == 100000 for 10000 times, dtype=torch.int16
0.10145393699895067
a.eq_(b), numel() == 10000 for 100000 times, dtype=torch.int32
0.21011781599918322
a.eq_(b), numel() == 100000 for 10000 times, dtype=torch.int32
0.20121852699958254
a.eq_(b), numel() == 10000 for 100000 times, dtype=torch.int64
0.36681504499938455
a.eq_(b), numel() == 100000 for 10000 times, dtype=torch.int64
0.364472848999867
a.ne_(b), numel() == 10000 for 100000 times, dtype=torch.float
0.2290963309988001
a.ne_(b), numel() == 100000 for 10000 times, dtype=torch.float
0.21674784300012107
a.ne_(b), numel() == 10000 for 100000 times, dtype=torch.double
0.3829616689999966
a.ne_(b), numel() == 100000 for 10000 times, dtype=torch.double
0.39437660300063726
a.ne_(b), numel() == 10000 for 100000 times, dtype=torch.int16
0.1661020749997988
a.ne_(b), numel() == 100000 for 10000 times, dtype=torch.int16
0.10052955100036343
a.ne_(b), numel() == 10000 for 100000 times, dtype=torch.int32
0.21827425599985872
a.ne_(b), numel() == 100000 for 10000 times, dtype=torch.int32
0.21522501399886096
a.ne_(b), numel() == 10000 for 100000 times, dtype=torch.int64
0.37058242300008715
a.ne_(b), numel() == 100000 for 10000 times, dtype=torch.int64
0.39304063900090114
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33252

Differential Revision: D20254663

Pulled By: ezyang

fbshipit-source-id: 68b7109ec4359434afbeb96df372e29608f501bb
2020-03-20 10:31:34 -07:00
8bcedf7da2 Adds truncated normal initializer (#32397)
Summary:
This adds the `trunc_normal_` function to `torch.nn.init` which allows for modifying tensors in-place to values drawn from a truncated normal distribution. I chose to use the inverse CDF method to implement this. I have included the appropriate code in `test_nn.py` for verifying that the values are from the correct distribution.

Reasons I chose this method:
1. Easily implemented to operate on memory in place, as the other initializers are.
1. No resampling delays
1. This method's main weakness is unlikely to be an issue. While the inverse CDF method can fail to generate the correct distribution when `b < mean` or `mean < a`,  I expect users will choose `a` and `b` so that `a < mean < b`. This method is extremely effective in this case.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32397

Differential Revision: D20550996

Pulled By: ezyang

fbshipit-source-id: 298a325043a3fd7d1e24d266e3b9b6cc14f81829
2020-03-20 10:29:05 -07:00
a5b5ea9852 use new cuda kernel launch code in nvprof parsing (#35016)
Summary:
This PR would fix https://github.com/pytorch/pytorch/issues/33986.

The meaning of cbid 13 and 211 can be found at here

837c094852/nvprof2json.py (L238)

837c094852/nvprof2json.py (L436)

or it can also be found in the header file at `/usr/local/cuda/extras/CUPTI/include/cupti_runtime_cbid.h`.

Please also check [this at stackoverflow](https://stackoverflow.com/questions/48552390/whats-the-difference-between-launching-with-an-api-call-vs-the-triple-chevron-s). I also executed the profiling code (in the issue) on CUDA 9.2, and the cbid is already changed to 211. Just in case someone would build pytorch against older CUDA versions, I leave both 13 and 211 in the assertion.

cc csarofeen ptrblck ezyang ngimel
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35016

Differential Revision: D20550879

Pulled By: ezyang

fbshipit-source-id: 968efc5e1126f1dd31acc9f5f4463f351d8a4c4f
2020-03-20 08:23:52 -07:00
e3272559e4 [caffe2] SWA operator (#34394)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34394

# SWA operator
In this diff, we added a new operator `SWA` which will be used in `AdaGradOptimizer`.

The algorithm looks like:

{F230902995}

# Background

In our testings, we found that this operator could improve our models' reproducibility a lot. (KT: 0.86 -> .92)

So we hope to land this operator and in future, enable this by default in our Models.

Test Plan:
Local build `aml.dper3:30f068668cfb408fbb40141fb17129f2` and bento kernel.
- Local test: n215857
- f174600345

Reviewed By: chocjy

Differential Revision: D20165239

fbshipit-source-id: c03cdd048cb10b091e5f06323f4c0f3999f95d8a
2020-03-20 08:17:08 -07:00
781f590f33 [C++ API Parity] Add xor_convergence test for lbfgs (#35001)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/35001

Differential Revision: D20548983

Pulled By: anjali411

fbshipit-source-id: 1f858635d0680c0109d1ef348b7df4d3844fe0a6
2020-03-20 06:57:24 -07:00
1c958f8ef9 Engine::~Engine() should wait for non-reentrant threads to shutdown (#34529)
Summary:
Because `this` must be valid while `Engine::main_thread` is running, at least for non-reentrant worker threads
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34529

Test Plan: Run `test_api --gtest-filter=ModulesTest.InstanceNorm1d` in a loop

Differential Revision: D20552717

Pulled By: malfet

fbshipit-source-id: a0197671db1b7b1499dda675e43e0826f368bf0d
2020-03-20 00:49:48 -07:00
ec9f680973 Enforce rref python pickling to be in the scope of RPC call (#34755)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34755

This diff disallows to use python pickler to pickle RRef. RRef can only be pickled in the scope of RPC call using _InternalRPCPickler.
ghstack-source-id: 100481337

Test Plan: unit tests

Differential Revision: D20453806

fbshipit-source-id: ebd4115ee01457ba6958cde805afd0a87c686612
2020-03-19 23:43:45 -07:00
6000dca5df [nomnigraph] Copy device option when customize the op conversion (#34976)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34976

Previously, we are dropping the original device option info when we override the operator conversion function.

Test Plan:
```
buck test caffe2/caffe2/opt:converter_nomigraph_test
```

Reviewed By: ipiszy

Differential Revision: D20507277

fbshipit-source-id: 66b5eab07d18651eff27dab2a809cd04872ac224
2020-03-19 22:48:28 -07:00
fe276d541e Revert D20541921: [pytorch][PR] [RELAND] Eager autocasting, out-of-place ops only (with MSVC 2017 fix)
Test Plan: revert-hammer

Differential Revision:
D20541921

Original commit changeset: abb5488dca86

fbshipit-source-id: d2c6038978f80e5429632f8b49107090a8a247f4
2020-03-19 22:39:12 -07:00
0ccdec6b4c Revert e7fc55e (#35080)
Summary:
resubmit D20464855 and also Fix the broken test due to D20464855
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35080

Differential Revision: D20551174

Pulled By: lly-zero-one

fbshipit-source-id: 5a0547a64365c556c3a677a9512423047497cc85
2020-03-19 22:32:32 -07:00
eb78f7ea41 torch.cat: disallow inputs on different devices (#35053)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/35045
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35053

Differential Revision: D20545517

Pulled By: ngimel

fbshipit-source-id: eee3fc87c7e578ff44d69d5ce6f92a8f496fa97b
2020-03-19 22:06:39 -07:00
89110fbe6c Fix torch.mm export to ONNX (#34661)
Summary:
torch.mm is exported as Gemm operator in ONNX and both have an optional input: out.
out is considered as broadcastable in Gemm and during graph optimization the optional input (out) would get selected. Since out is optional, in case when it is not defined in torch.mm that would result in the following exception:
IndexError: vector::_M_range_check: __n (which is 2) >= this->size() (which is 2)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34661

Reviewed By: hl475

Differential Revision: D20496398

Pulled By: houseroad

fbshipit-source-id: e677aef0a6aefb1f83a54033153aaabe5c23bc0f
2020-03-19 21:59:34 -07:00
e65ac7af14 Also vectorize complex types in fill. (#34973)
Summary:
Given that complex types have also been vectorized, there is no need to
handle complex types differently in fill.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34973

Differential Revision: D20551014

Pulled By: ezyang

fbshipit-source-id: e0cb519aa17f90b7a2d70700b32b80acb0d41b14
2020-03-19 21:22:04 -07:00
463f7920bd repr and _*state_dict for qRNN (#31540)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31540

Fixes #31468

Test Plan: Imported from OSS

Differential Revision: D19205894

Pulled By: z-a-f

fbshipit-source-id: 80c36f74aa20a125ea8d74a54e9905576f1bc6d7
2020-03-19 20:49:50 -07:00
991b97277a [RELAND] Eager autocasting, out-of-place ops only (with MSVC 2017 fix) (#35011)
Summary:
https://github.com/pytorch/pytorch/pull/32140 was approved and merged, but [reverted](d0577e19f0) because it broke builds with versions of Visual Studio older than 15.8 that were not represented in public CI.  The build failures were caused by a [known VS bug](https://developercommunity.visualstudio.com/content/problem/27729/allow-function-with-internal-linkage-as-template-n.html), fixed in versions 15.8 and newer.

The present PR reverts the revert (restoring https://github.com/pytorch/pytorch/pull/32140 's diffs) and adds a workaround to enable compilation with VS < 15.8.  The workaround isn't pretty, but it's guarded by macros such that it's only used when compiling with VS < 15.8.  All other builds compile with the same code/control flow as was merged in https://github.com/pytorch/pytorch/pull/32140.

Original description of https://github.com/pytorch/pytorch/pull/32140:
> Initial integration of eager autocasting, supporting out-of-place ops only for easier review.
Relevant issue/RFC: https://github.com/pytorch/pytorch/issues/25081

> In-place ops and ops with user-supplied out=... can certainly be supported as well (my initial WIP https://github.com/pytorch/pytorch/issues/29552 handled many) but require substantially more complex special casing in the autocasting backend and tests. Support for these ops (much of which has already been written) will be broken into later PRs.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35011

Differential Revision: D20541921

Pulled By: ezyang

fbshipit-source-id: abb5488dca8620b0daac4306ebf2bb47fc36e4f5
2020-03-19 20:18:18 -07:00
edb794fb19 [ROCm] Enable BFloat16 type for TopK operator on ROCm. (#34849)
Summary:
This PR enables bfloat16 for topk on ROCm.

iotamudelta ezyang
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34849

Differential Revision: D20544732

Pulled By: ezyang

fbshipit-source-id: 1ad017a4403d2a429d98e60c8eb1f78b320df920
2020-03-19 20:04:08 -07:00
bb63710c9a Reduce the number of iterations in test_autograd_context (#35037)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35037

Closes #34960

Cannot reproduce the test failure in dev server, local machine, and
the CI env that captured the failure. However, the failed test takes
very long (~10sec) in MacOS, so reducing the number of iterations to
make it lighter.

Re-enable the test and will monitor if the error occurs again.

Test Plan: Imported from OSS

Differential Revision: D20536272

Pulled By: mrshenli

fbshipit-source-id: 577822574e5f6271f1cbb14b56c68c644291713e
2020-03-19 19:48:54 -07:00
3c90a90730 Revert D20540599: Add TensorExpr Fuser tests.
Test Plan: revert-hammer

Differential Revision:
D20540599

Original commit changeset: ced9b6657fe7

fbshipit-source-id: e8fa11f20207c35f39b3fbe6f45fc627715377c1
2020-03-19 18:37:32 -07:00
e7fc55ef7b Revert D20464855: [pytorch][PR] Add the fusion of quantized batchnorm and relu
Test Plan: revert-hammer

Differential Revision:
D20464855

Original commit changeset: 57090d427053

fbshipit-source-id: e7c50b5e7cd27a479539d7ee17580118377971c5
2020-03-19 18:31:11 -07:00
a4afac6076 enforce rref JIT pickling to be in the scope of rpc calls (#34689)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34689

rref JIT pickling is only allowed inside rpc calls. enforcing this by adding a thread local variable isInRpcCall and set it as True when converting rpc requests or responses to message, before calling JIT::pickle(). Inside JIT::pickle(), it allowes to pickle RRef only when the isInRpcCall is true.
ghstack-source-id: 100481001

Test Plan: unit tests

Differential Revision: D20429826

fbshipit-source-id: dbc04612ed15de5d6c7d75a4732041ccd4ef3f8c
2020-03-19 18:07:39 -07:00
8210b2054e Move ivalue tests to aten (#34985)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34985

IValue is part of the overall runtime system, not just the JIT. So it
should be tested in the ATen tests.

The real motivation though is so that I can use gtest directly, not the
hacked-up version the JIT uses.

Test Plan: Imported from OSS

Differential Revision: D20537902

Pulled By: suo

fbshipit-source-id: 09897e015ecde24aa8996babeaa08d98db90ef0d
2020-03-19 17:56:37 -07:00
5f32dfca16 Add equality comparison to c10::Dict (#34892)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34892

Same rationale and implementation as in
https://github.com/pytorch/pytorch/pull/34856

Test Plan: Imported from OSS

Differential Revision: D20493169

Pulled By: suo

fbshipit-source-id: 46d79a4ff5d4af2964cfaeb2c43f56decadf3201
2020-03-19 17:56:32 -07:00
90045ce5e0 Add equality comparisons to c10::List (#34856)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34856

This PR adds Python-like equality comparisons to our List type.
- `operator==` performs equality by value.
- `is` performs equality by identity.

The overall goal is that I want to define equality on `IValue` to avoid
people implementing their own broken versions. So, we should have
equality reasonably defined on all types that `IValue` can be.

smessmer raises the concern that C++ people expect `operator==` on
reference types to test identity. I think that's a reasonable concern,
but in practice, it seems that people are defining equality functions to
do it by value anyway, just poorly. My claim is that if we just tell
people that TorchScript types behave like Python types, it will not be
super confusing.

Test Plan: Imported from OSS

Differential Revision: D20483462

Pulled By: suo

fbshipit-source-id: ba2909daa6778924293ed6ef456ab9fc84215442
2020-03-19 17:55:16 -07:00
d3f5045bf5 PyTorch should always depend on future (#35057)
Summary:
Because `past` is used in `caffe2.python.core`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35057

Test Plan: CI

Differential Revision: D20547042

Pulled By: malfet

fbshipit-source-id: cad2123c7b88271fea37f21e616df551075383a8
2020-03-19 17:31:47 -07:00
33dcaaa872 [quant][onnx] Add aten::max_pool2d to jit pass (#34912)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34912

max_pool2d quantized op actually shows up as aten::max_pool2d

Test Plan:
python test/test_pytorch_onnx_caffe2_quantized.py

Imported from OSS

Differential Revision: D20497780

fbshipit-source-id: 5524ae41676c2d6de1ae3544fe36ac24f2a77b19
2020-03-19 16:57:20 -07:00
fd57e0901e remove the slow path(NCHW) for avg_pool3d (#34994)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34994

Use the fast path for NCHW input tensor

Test Plan: run the pool unit tests

Differential Revision: D20522082

fbshipit-source-id: 6e834425d06fbb1a105d851c2c36ef73df9de08f
2020-03-19 16:20:00 -07:00
d2d26bf643 [rpc] fix test_debug_info for python 3.5 (#34828)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34828

python 3.5 does not ensure ordering of dictionary keys, this was added
in python 3.6+. Fixing this so the test is no longer flaky in 3.5. Tested by
500 stresstests with python 3.5
ghstack-source-id: 100426555

Test Plan: 500 stress tests in python 3.5

Differential Revision: D20474996

fbshipit-source-id: 89b614a32363d1e7f3f7a4f27bec4fd7d507721d
2020-03-19 16:12:58 -07:00
733b6315fd Add the fusion of quantized batchnorm and relu (#34795)
Summary:
As title, we want to support the BN2d_relu and BN3d_relu

Test to be added!
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34795

Differential Revision: D20464855

Pulled By: lly-zero-one

fbshipit-source-id: 57090d427053c9c94c1b387b33740a7e61261a9d
2020-03-19 16:01:00 -07:00
851579d868 [Onnxifi] Blacklist ops in the partitions that are supposed to run on CPU (#34991)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34991

The definition for the partition to be run on CPU is that it will contain an empty device_id list. We chose this over an op with no partitioning info because
1. Backward compatible with models that don't have partitioning info
2. Being explicit can flush out issues in earlier stage.

Test Plan:
```
LD_LIBRARY_PATH=third-party-buck/platform007/build/fb-nnpi/lib ./sigrid/predictor/tests/scripts/launch_ads_test_predictor.sh -g --nnpi --force_models=175742819_0 --sigrid_force_model_dir=$HOME/models/ --smc_server_port=7447 --glow-num-devices=1 --glow_interpreter_memory=$((256<<20)) --caffe2_fbgemm_fake_fp16_clamp --glow_global_fp16 --glow_clip_fp16 --glow_global_fused_scale_offset_fp16 --fbgemm_deserialize_to_original_format --caffe2_dump_input_of_type=Onnxifi --caffe2_logging_print_tensor --caffe2_predictor_use_memonger=no --onnxifi_debug_mode=true --caffe2_dump_input_with_recordio --caffe2_predictor_onnxifi_max_batch_size=32 --caffe2_predictor_onnxifi_max_seq_size=9600  --glow_onnxifi_backend=Interpreter  --onnxifi_blacklist_ops=SparseLengthsSum,SparseLengthsWeightedSum --glow_dump_graph
```

Now it hits a new error.

Reviewed By: ipiszy

Differential Revision: D20503167

fbshipit-source-id: 5a609760130bd1131e299ce85b7824cbcbdf1f09
2020-03-19 15:10:49 -07:00
7b59f41009 Add TensorExpr Fuser tests. (#35052)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/35052

Differential Revision: D20540599

Test Plan: Imported from OSS

Pulled By: ZolotukhinM

fbshipit-source-id: ced9b6657fe72bca61833ab5d59bdaddcacd114b
2020-03-19 14:31:54 -07:00
02e16b38f3 Remove the use of two magic numbers in vec256 (#35003)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/35003

Differential Revision: D20536130

Pulled By: albanD

fbshipit-source-id: 7e3c46eeebfffff045f53c60c7b510fad62d6a98
2020-03-19 14:20:21 -07:00
7065c46ea2 Respect dist autograd context in torch.jit._fork. (#34360)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34360

The distributed autograd context sets up a thread local context id
which is used to perform appropriate book keeping and autograd recording of RPC
functions in the forward pass.

However, if we use torch.jit._fork within the distributed autograd context, the
code executed within torch.jit._fork will lose this context since it is run in
a separate JIT thread and the thread local is not set in that thread.

To fix this problem, we pass in the distributed autograd context to
torch.jit._fork similar to what we did in
https://github.com/pytorch/pytorch/pull/16101.
ghstack-source-id: 100445465

Test Plan: waitforbuildbot

Differential Revision: D20301352

fbshipit-source-id: aa3fffe69c2b40722c66213351a4e0d77484a621
2020-03-19 14:12:28 -07:00
b3fccda4a9 [DPER3][Shape inference] Update Shape Information in dper3 backend (#34475)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/34475

Differential Revision: D20332799

fbshipit-source-id: 16aa7399eb48ce4d1d0f8431941ae1252322c382
2020-03-19 13:49:34 -07:00
c957580133 Add promotion pipeline for S3 and conda artifacts (#34993)
Summary:
Adds a new promotion pipeline for both our wheel packages hosted on S3
as well as our conda packages hosted on anaconda.

Promotion is only run on tags that that match the following regex:

    /v[0-9]+(\.[0-9]+)*/

Example:

    v1.5.0

The promotion pipeline is also only run after a manual approval from
someone within the CircleCI security context "org-member"

> NOTE: This promotion pipeline does not cover promotion of packages that
>      are published to PyPI, this is an intentional choice as those
>      packages cannot be reverted after they have been published.

TODO: Write a proper testing pipeline for this

Signed-off-by: Eli Uriegas <eliuriegas@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34993

Differential Revision: D20539497

Pulled By: seemethere

fbshipit-source-id: 104772d3c3898d77a24ef9bf25f7dbd2496613df
2020-03-19 13:36:51 -07:00
37b234a880 quantized hardsigmoid, take 2 (#34959)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34959

Adds quantized implementation of hardsigmoid.

Original PR was https://github.com/pytorch/pytorch/pull/34607 and had to
be reverted for a test breakage, trying again.

Test Plan:
tests
benchmarks

Imported from OSS

Differential Revision: D20514212

fbshipit-source-id: cc7ae3b67757e2dde5c313c05ce60a0f2625d961
2020-03-19 13:27:22 -07:00
74009dc558 [profiler] use swap in allocBlock to reduce time the lock is held. (#34499)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34499

RangeEventList::allocBlock currently iterates through `blocks`, which
we serialize access to and accumulates them into `result`. Instead of doing
this, we can swap with an empty `forward_list` in constant time, and then
unlock, and use this local list in order to populate `result`.
ghstack-source-id: 100426115

Test Plan: existing profiler tests pass

Differential Revision: D20346423

fbshipit-source-id: 0e567b56049daa371051ccec6c5d1630a92db15f
2020-03-19 13:07:35 -07:00
e433271320 Install CUDA manually on Windows CI to avoid flakiness (#34940)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/34821.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34940

Differential Revision: D20538056

Pulled By: ezyang

fbshipit-source-id: af1b2e9f9a80796763025d01eade3ab31b9d0cdb
2020-03-19 12:23:55 -07:00
53539567cb [Onnxifi] Copy partitioning info when lowering to glow
Summary: So that Glow can use this info to do actual function partitioning.

Reviewed By: jfix71

Differential Revision: D20502439

fbshipit-source-id: 0ade94b49b49172dc9370d1fc96454ade52ff269
2020-03-19 12:15:04 -07:00
226f559394 Updating submodules
Summary:
GitHub commits:

ced74147c3
33849b670b
63bf7655e4
d70eb504b7
442404558a
fbf509dcb5

Test Plan: n/a

Reviewed By: wittgenst

fbshipit-source-id: a3eb6b95a915e85e88719ca5870e5c34f4dfed7f
2020-03-19 11:19:16 -07:00
7335f079ab [pt][quant] qmul and qadd should preserve input memory format (#34834)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34834

They should keep the activations in channelLast format, i.e., the same as input tensors to these operations.

### Before
```
 -------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------
Name                       Self CPU total %  Self CPU total   CPU total %      CPU total        CPU time avg     Number of Calls
-------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------
quantize_per_tensor        0.06%            129.181us        0.06%            129.181us        129.181us        1
quantized::conv2d          21.74%           47.744ms         21.74%           47.744ms         408.067us        117
quantized::add_scalar      16.36%           35.930ms         16.36%           35.930ms         520.726us        69
quantized::relu6           0.69%            1.515ms          0.69%            1.515ms          21.959us         69
quantized::mul_scalar      6.08%            13.364ms         6.08%            13.364ms         193.676us        69
quantized::mul             53.17%           116.781ms        53.17%           116.781ms        1.269ms          92
adaptive_avg_pool2d        0.02%            42.700us         1.61%            3.527ms          146.948us        24
_adaptive_avg_pool2d       1.59%            3.484ms          1.59%            3.484ms          145.169us        24
sigmoid                    0.08%            173.702us        0.08%            173.702us        7.552us          23
quantized::add             0.20%            445.648us        0.20%            445.648us        27.853us         16
dropout                    0.00%            2.598us          0.00%            2.598us          2.598us          1
view                       0.00%            10.311us         0.00%            10.311us         10.311us         1
dequantize                 0.00%            4.645us          0.00%            4.645us          4.645us          1
-------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------
Self CPU time total: 219.627ms
```

### After
```
  -------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------
Name                       Self CPU total %  Self CPU total   CPU total %      CPU total        CPU time avg     Number of Calls
-------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------
quantize_per_tensor        0.12%            155.807us        0.12%            155.807us        155.807us        1
quantized::conv2d          25.50%           31.981ms         25.50%           31.981ms         273.343us        117
quantized::add_scalar      44.53%           55.840ms         44.53%           55.840ms         809.281us        69
quantized::relu6           1.25%            1.570ms          1.25%            1.570ms          22.749us         69
quantized::mul_scalar      10.73%           13.449ms         10.73%           13.449ms         194.914us        69
quantized::mul             16.67%           20.904ms         16.67%           20.904ms         227.220us        92
adaptive_avg_pool2d        0.03%            41.713us         0.69%            862.922us        35.955us         24
_adaptive_avg_pool2d       0.65%            821.209us        0.65%            821.209us        34.217us         24
sigmoid                    0.15%            182.344us        0.15%            182.344us        7.928us          23
quantized::add             0.34%            431.939us        0.34%            431.939us        26.996us         16
dropout                    0.00%            1.936us          0.00%            1.936us          1.936us          1
view                       0.01%            10.281us         0.01%            10.281us         10.281us         1
dequantize                 0.00%            4.562us          0.00%            4.562us          4.562us          1
-------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------
Self CPU time total: 125.394ms

```
ghstack-source-id: 100305788

Test Plan: buck test //caffe2/test:quantized

Differential Revision: D20473713

fbshipit-source-id: c878fbb8f5a1a33f0cdac2657cc61e97ceb6c183
2020-03-19 10:12:52 -07:00
6d488714a7 .circleci: Specify setup job to run on everything (#35013)
Summary:
CircleCI by default, chooses to run 0 jobs on tags meaning that when we
tag a build that no job is run if a dependent job does not contain the
correct filters.

This adds an explicit configuration to run the setup job on every branch
and every tag that CircleCI can run on.

For more information on CircleCI filters and what they do (and more
importantly what they do not do) visit:

https://circleci.com/docs/2.0/configuration-reference/#filters-1

Signed-off-by: Eli Uriegas <eliuriegas@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35013

Differential Revision: D20535560

Pulled By: seemethere

fbshipit-source-id: 7ee5dddbc0a9416fd76ed198e5447318c53e1873
2020-03-19 09:36:27 -07:00
35d9874a35 in test_data_parallel.py, remove skipIfRocm from tests that pass (#34978)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/34978

Differential Revision: D20535920

Pulled By: mrshenli

fbshipit-source-id: 3baa8608dd3b0dd5578bc32e56a2e6c1fe69492d
2020-03-19 09:16:43 -07:00
1f4a4aaf64 functional autograd api (#34066)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34066

Basic implementation of https://github.com/pytorch/pytorch/issues/30632

Test Plan: Imported from OSS

Differential Revision: D20260307

Pulled By: albanD

fbshipit-source-id: 7db5c2411ddc3e954ff8fbbe93eb3b96a2bcfb2f
2020-03-19 08:24:07 -07:00
96860af870 Revert D20164420: [1.5 Release][Dist Autograd][Better Engineering] Notify Workers on Failure during Distributed Autograd
Test Plan: revert-hammer

Differential Revision:
D20164420

Original commit changeset: 3d4ed7423096

fbshipit-source-id: 67f0f9c11cee84df6dbe37db7821dd601227df66
2020-03-19 08:02:07 -07:00
7c06b86e42 Revert D20518647: [pytorch][PR] [C++ API Parity] [Optimizers] Merged Optimizer and LossClosureOptimizer
Test Plan: revert-hammer

Differential Revision:
D20518647

Original commit changeset: 4760d1d29df1

fbshipit-source-id: b84f1a06c2de27e147716279223a6844ef89f760
2020-03-19 07:53:43 -07:00
5d92a6cc30 Revert D7778113: Reland "[RPC] Use qualified name str directly in RPC torch script code path"
Test Plan: revert-hammer

Differential Revision:
D7778113

Original commit changeset: b830c03ac946

fbshipit-source-id: ef08b287a6db58320c738cde0c99b3333f5724eb
2020-03-19 06:05:23 -07:00
9c4683e8e3 Revert D20312366: [pytorch][PR] Added type promotion logic for complex numbers
Test Plan: revert-hammer

Differential Revision:
D20312366

Original commit changeset: 90f00a1a916d

fbshipit-source-id: 4510739a888b2eec5d8a72e792998ac46da6d82a
2020-03-19 05:55:57 -07:00
0d8447a9b8 Warns when performing integer division with div and addcdiv (#34570)
Summary:
Per title.

In the future we want to make div(), the division operator, and addcdiv perform true division as in Python 3, NumPy, and JAX. To do this without silently breaking users we plan to:

- Warn (once) in 1.5 when a user performs integer division using div or addcdiv
- RuntimeError in 1.6 when a user attempts to perform integer division using div or addcdiv
- Always perform true division in 1.7 using div, /, and addcdiv

Users can use true_divide or floor_divide today to explicitly specify the type of division they like.

A test for this behavior is added to test_type_promotion. Unfortunately, because we are only warning once (to avoid a deluge) the test only uses maybeWarns Regex.

The XLA failure is real but will be solved by https://github.com/pytorch/pytorch/pull/34552. I'll be sure to land that PR first to avoid temporarily breaking the XLA build.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34570

Differential Revision: D20529211

Pulled By: mruberry

fbshipit-source-id: 65af5a9641c5825175d029e8413c9e1730c661d0
2020-03-19 04:10:55 -07:00
6f737dd4a3 Fix signed-unsigned warnings (#34791)
Summary:
And few typos
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34791

Test Plan: CI

Differential Revision: D20524879

Pulled By: malfet

fbshipit-source-id: 58fa03bd6356979e77cd1bffb6370d41a177c409
2020-03-19 00:29:56 -07:00
c8f665dcb6 Added type promotion logic for complex numbers (#34093)
Summary:
Issue: https://github.com/pytorch/pytorch/issues/33780
After this PR:
1. dtype promotion logic will correctly work for ops involving complex scalars
2. torch.ComplexFloatTensor, torch.ComplexDoubleTensor works
3. added alias for complex64 (cfloat) and complex128 (cdouble)
4. added an internal function get_complex_default_dtype (consciously not exposed in public API)

>>> 1j*torch.ones(2)
tensor([(0.0000 + 1.0000j), (0.0000 + 1.0000j)], dtype=torch.complex64)

>>> torch.set_default_dtype(torch.float64)
>>> 1j*torch.ones(2)
tensor([(0.0000 + 1.0000j), (0.0000 + 1.0000j)], dtype=torch.complex128)

>>> 1j + torch.ones(2)
tensor([(1.0000 + 1.0000j), (1.0000 + 1.0000j)], dtype=torch.complex128)

>>> torch.tensor(1j) + torch.ones(2,2)
tensor([[(1.0000 + 1.0000j), (1.0000 + 1.0000j)],
        [(1.0000 + 1.0000j), (1.0000 + 1.0000j)]], dtype=torch.complex128)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34093

Differential Revision: D20312366

Pulled By: anjali411

fbshipit-source-id: 90f00a1a916d9c8eeda101eb6e9d250fce569815
2020-03-18 23:36:13 -07:00
d616cad676 Reland "[RPC] Use qualified name str directly in RPC torch script code path" (#34962)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34962

Relanding #34733. Fix is in https://github.com/pytorch/pytorch/pull/34988.

Test Plan:
```
buck test mode/dev-nosan //caffe2/test/distributed/rpc:rpc_fork
```

```
buck test mode/dev-nosan //caffe2/test/distributed/rpc/jit:rpc_fork

buck build mode/dev-nosan //caffe2/test/distributed/rpc/jit:rpc_fork \
&& buck-out/gen/caffe2/test/distributed/rpc/jit/rpc_fork\#binary.par \
-r test_return_local_script_class_rref_in_py_and_use_in_script

buck build mode/dev-nosan //caffe2/test/distributed/rpc/jit:rpc_fork \
&& buck-out/gen/caffe2/test/distributed/rpc/jit/rpc_fork\#binary.par \
-r test_return_local_script_module_rref_in_py_and_use_in_script
```

```
buck test mode/dev //caffe2/test/distributed/rpc/jit:rpc_fork_thrift -- test_return_local_script_module_rref_in_py_and_use_in_script
```

Differential Revision: D7778113

fbshipit-source-id: b830c03ac9463075fca248eba75be364b0e8b080
2020-03-18 22:25:09 -07:00
be82e554fe Revert D20524479: [pytorch][PR] [C++ API Parity] Add xor_convergence test for lbfgs
Test Plan: revert-hammer

Differential Revision:
D20524479

Original commit changeset: 3413779676ab

fbshipit-source-id: ef8007ed6c184bc8b8751eb713aac2a891260048
2020-03-18 21:56:17 -07:00
153b16ef4c Doxygen for torchbind (#35007)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/35007

Test Plan: Imported from OSS

Reviewed By: driazati

Differential Revision: D20525680

Pulled By: jamesr66a

fbshipit-source-id: aaa768f395e30dcec8007d50e17f21837c306719
2020-03-18 21:49:24 -07:00
eef17edaa3 Fix warnings in test/test_jit_fuser.py (#34980)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34980

We were passing sample inputs to `torch.jit.script` (as if it was
`torch.jit.trace`), but this parameter was treated as an optional
`optimize` parameter. That parameter is deprecated and that caused a
warning.

Differential Revision: D20520369

Test Plan: Imported from OSS

Pulled By: ZolotukhinM

fbshipit-source-id: 87b40a5e35bfc4a3d7a5d95494632bfe117e40b7
2020-03-18 19:55:25 -07:00
55b254e114 update gitignore to include clangd index (#35018)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/35018

Test Plan: Imported from OSS

Differential Revision: D20528402

Pulled By: suo

fbshipit-source-id: badb487a4fbb0299b49c1b1022bcd7b61eba1e88
2020-03-18 19:53:03 -07:00
d3b6099366 [build] Update gloo submodule (#34969)
Summary:
Update gloo submodule to `113bde13035594cafdca247be953610b53026553` be compatible with separate compilation introduced by
https://github.com/facebookincubator/gloo/pull/251
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34969

Test Plan: CI

Differential Revision: D20527163

Pulled By: malfet

fbshipit-source-id: 300d83d8fe95d57b8d740543efada3c56ac7b493
2020-03-18 19:24:23 -07:00
5f67c923f1 [1.5 Release][Dist Autograd][Better Engineering] Notify Workers on Failure during Distributed Autograd (#34638)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34638

Fixes: https://github.com/pytorch/pytorch/issues/27643

This PR manages notifying workers in the event of a failure during distributed autograd. Gracefully handles propagating errors across all nodes in the backward pass and sets state in the local autograd engines accordingly.

(Note: this ignores all push blocking failures!)

Test Plan: Added 2 new tests checking errors when they are thrown in an intermediate node during distributed autograd. Ensured that all existing distributed autograd tests pass.

Differential Revision: D20164420

fbshipit-source-id: 3d4ed74230969ac70bb763f1b5b1c16d979f66a2
2020-03-18 18:56:14 -07:00
a73dfcf8cf Adjust ProtoBufPatch to protobuf-3.11.x (#35008)
Summary:
`GetEmptyStringAlreadyInited` invocation pattern in protobuf generated header files chanegd to
 `:PROTOBUF_NAMESPACE_ID::internal::GetEmptyStringAlreadyInited`, where `PROTOBUF_NAMESPACE_ID` is defined in `protobuf/port_def.inc` as `google::protobuf`

This likely to have changed around protobuf-3.8.x time, but I've only tested it using protobuf-3.11.4
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35008

Test Plan: Update `third-party/protobuf` submodule to 3.11.4, compile and run `pattern_net_transform_test`

Differential Revision: D20526949

Pulled By: malfet

fbshipit-source-id: fddaa3622c48ad883612c73c40a20d306d88d66b
2020-03-18 18:35:23 -07:00
e5ee95e448 [RPC] Add to confirmed users immediately if the fork is shared from owner, instead of adding nothing to pending users (#34988)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34988

In https://github.com/pytorch/pytorch/pull/31893, we introduced a confirmedUsers_ map in RRefContext.

For the case that the fork is shared from the owner,  there is no `pendingUsers_` intermediate phase for this fork, we should put this fork into `confirmedUsers_` immediately.

Test Plan:
```
buck test mode/dev-nosan //caffe2/test/distributed/rpc:rpc_fork
```

```
buck test mode/dev-nosan //caffe2/test/distributed/rpc/jit:rpc_fork
```

Differential Revision: D7735909

fbshipit-source-id: 14c36a16486f0cc9618dcfb111fe5223781b647d
2020-03-18 18:17:41 -07:00
b8e043abca [C++ API Parity] [Optimizers] Merged Optimizer and LossClosureOptimizer (#34957)
Summary:
1. Removed LossClosureOptimizer, and merged Optimizer into OptimizerBase (and renamed the merged class to Optimizer)
2. Merged the LBFGS-specific serialize test function and the generic test_serialize_optimizer function.
3. BC-compatibility serialization test for LBFGS
4. Removed mentions of parameters_ in optimizer.cpp, de-virtualize all functions
5. Made defaults_ optional argument in all optimizers except SGD
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34957

Test Plan: Imported from GitHub, without a `Test Plan:` line.

Differential Revision: D20518647

Pulled By: anjali411

fbshipit-source-id: 4760d1d29df1784e2d01e2a476d2a08e9df4ea1c
2020-03-18 17:28:57 -07:00
2a1c83823d [tools] Parallelize tools/clang_format_new.py (#34750)
Summary:
**Summary**
This commit parallelizes the invocation of `clang-format` on all files
in `tools/clang_format_new.py` using `asyncio`.

**Testing**
Ran and timed the script.

*Before*
```
$ time ./tools/clang_format_new.py  --diff
...
real	0m7.615s
user	0m6.012s
sys	0m1.634s
```

*After*
```
$ time ./tools/clang_format_new.py  --diff
...
Some files not formatted correctly

real	0m2.156s
user	0m8.488s
sys	0m3.201s
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34750

Differential Revision: D20523133

Pulled By: SplitInfinity

fbshipit-source-id: 509741a0b4fcfcdcd7c5a45654e3453b4874d256
2020-03-18 17:27:02 -07:00
6e47e7bf52 [pytorch][mobile] fixed AutoGradMode/AutoNonVariableTypeMode uses for mobile callsites
Summary:
There are three guards related to mobile build:
* AutoGradMode
* AutoNonVariableTypeMode
* GraphOptimizerEnabledGuard

Today we need set some of these guards before calling libtorch APIs because we customized mobile build to only support inference (for both OSS and most FB use cases) to optimize binary size.

Several changes were made since 1.3 release so there are already inconsistent uses of these guards in the codebase. I did a sweep of all mobile related model loading & forward() call sites, trying to unify the use of these guards:

Full JIT: still set all three guards. More specifically:
* OSS: Fixed a bug of not setting the guard at model load time correctly in Android JNI.
* FB: Not covered by this diff (as we are using mobile interpreter for most internal builds).

Lite JIT (mobile interpreter): only needs AutoNonVariableTypeMode guard. AutoGradMode doesn't seem to be relevant (so removed from a few places) and GraphOptimizerEnabledGuard definitely not relevant (only full JIT has graph optimizer). More specifically:
* OSS: At this point we are not committed to support Lite-JIT. For Android it shares the same code with FB JNI callsites.
* FB:
** JNI callsites: Use the unified LiteJITCallGuard.
** For iOS/C++: manually set AutoNonVariableTypeMode for _load_for_mobile() & forward() callsites.

Ideally we should avoid having to set AutoNonVariableTypeMode for mobile interpreter. It's currently needed for dynamic dispatch + inference-only mobile build (where variable kernels are not registered) - without the guard it will try to run `variable_fallback_kernel` and crash (PR #34038). The proper fix will take some time so using this workaround to unblock selective BUCK build which depends on dynamic dispatch.

PS. The current status (of having to set AutoNonVariableTypeMode) should not block running FL model + mobile interpreter - if all necessary variable kernels are registered then it can call _load_for_mobile()/forward() against the FL model without setting the AutoNonVariableTypeMode guard. It's still inconvenient for JAVA callsites as it's set unconditionally inside JNI methods.

Test Plan: - CI

Reviewed By: xta0

Differential Revision: D20498017

fbshipit-source-id: ba6740f66839a61790873df46e8e66e4e141c728
2020-03-18 17:19:35 -07:00
a4048b4703 port ge changes from bert/pytorch_fusion (#34942)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/34942

Differential Revision: D20505894

Pulled By: Krovatkin

fbshipit-source-id: 7b442fae6aa2b1a29891b94f824094a1fddae4a2
2020-03-18 17:13:24 -07:00
4521477f83 [C++ API Parity] Add xor_convergence test for lbfgs (#35001)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/35001

Differential Revision: D20524479

Pulled By: anjali411

fbshipit-source-id: 3413779676ab95c1ee82298f95d3441a89873107
2020-03-18 17:06:53 -07:00
bcbde490e4 Fix flake (#34974)
Summary:
fix flake, add overload names
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34974

Differential Revision: D20519191

Pulled By: eellison

fbshipit-source-id: d08d36b64397287cad484690074e694d8a0e472e
2020-03-18 16:45:33 -07:00
b2e5e0cad6 [quant][graphmode] quantization support for aten::rehshape (#34803)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/34803

Test Plan:
python test/test_jit.py

Imported from OSS

Differential Revision: D20504457

fbshipit-source-id: 5ca691ef4880c72d30d62390e63e3288b2f06dce
2020-03-18 15:40:43 -07:00
69e701fbf9 Add transfer_learning_blob_name_mappings into layer_model_helper to support layer model transfer learning
Summary: Add transfer_learning_blob_name_mappings into layer_model_helper to support layer model transfer learning

Reviewed By: mraway

Differential Revision: D20286298

fbshipit-source-id: de3e029611d843f38d3f42ecd4148358f7e14a2b
2020-03-18 15:28:00 -07:00
e35dd4f603 [jit] Include call stack in OSError message (#34669)
Summary:
Previously there was no indication of why you would get an `OSError` for something (such as the generated methods of a `dataclass`).
](https://our.intern.facebook.com/intern/diff/20426570/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34669

Pulled By: driazati

Differential Revision: D20426570

fbshipit-source-id: 45d63631984fa26a87c03de5523fb10d8abbc6db
2020-03-18 15:10:23 -07:00
3b7e1cd2cc Makes floor_divide a method, adds sparse floor division (#34552)
Summary:
(Updated per review feedback)

`torch.floor_divide` is currently a function that can operate on two tensors or a tensor and a scalar (scalar x scalar floor division is handled natively by Python and the JIT has a builtin function for it). This PR updates it to:

- have an out variant: `floor_divide(x, y, out=z)`
- be a method on a tensor: `x.floor_divide(y)`
- have an in-place variant: `x.floor_divide_(y)`
- work with sparse tensors

Tests are added to test_sparse.py and test_torch.py for these new behaviors.

In addition, this PR:

- cleans up the existing sparse division and true_division code and improves their error message
- adds testing of sparse true_division to test_sparse.py
- extends existing floor_divide testing in test_torch to run on CUDA, too, not just the CPU

Unfortunately, making floor_divide a method requires breaking backwards compatibility, and floor_divide has been added to the BC whitelist since this is international. The BC issue is that the first parameter name to torch.floor_divide is changing from input to self. If you previously called torch.floor_divide with keyword arguments, e.g. torch.floor_divide(input=x, other=y), you will need to update to torch.floor_divide(self=x, other=y), or the more common torch.floor_divide(x, y).

The intent of this PR is to allow floor_divide to be substituted for division (torch.div, /) wherever division was previously used. In 1.6 we expect torch.div to perform true_division, and floor_divide is how users can continue to perform integer division with tensors.

There are two potential follow-up issues suggested by this PR:

- the test framework might benefit from additional tensor construction classes, like one to create dividends and divisors for multiple dtypes
- the test framework might benefit from a universal function test class. while methods have reasonable coverage as part of test_torch.py's TestTensorOp tests, function coverage is spotty. Universal functions are similar enough it should be possible to generate tests for them.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34552

Differential Revision: D20509850

Pulled By: mruberry

fbshipit-source-id: 2cd3c828aad67191c77f2ed8470411e246f604f8
2020-03-18 15:00:53 -07:00
d77d907f0e [quant][graphmode] Add quantization support for aten::dropout (#34347)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/34347

Test Plan:
python test/test_jit.py

Imported from OSS

Differential Revision: D20504453

fbshipit-source-id: 1bab29e21d0564ed88cdeb4894addfe00ebbd390
2020-03-18 14:35:27 -07:00
c747f09846 Add operator [] to c10::impl::ListIterator (#34926)
Summary:
This is causing failures on my Windows build
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34926

Differential Revision: D20501850

Pulled By: smessmer

fbshipit-source-id: 92c72dd657b27b1786952dbdccfceff99f4ba743
2020-03-18 12:57:38 -07:00
064f6285af Torchvision in jenkins testing (#34909)
Summary:
This pull request updates the Torchvision commit to use ROCm enabled torchvision in `.jenkins/pytorch/test.sh`.
Pytorch tests:
```
test_SyncBatchNorm_process_group (__main__.TestDistBackend)
test_alexnet (jit.test_models.TestModels)
test_script_module_script_resnet (jit.test_models.TestModels)
test_script_module_trace_resnet18 (jit.test_models.TestModels)
test_torchvision_smoke (__main__.TestTensorBoardPytorchGraph)
```
in `test2` were skipped because torchvision was not installed in `test2` instead it was installed in `test1`. The PR moved torchvision test to correct place and thereby enabling the above mentioned tests.

cc: ezyang iotamudelta
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34909

Differential Revision: D20515333

Pulled By: ezyang

fbshipit-source-id: 69439756a687ba441c1f8107233b4dbc1e108387
2020-03-18 12:45:51 -07:00
1afc584188 Deprecates current torch.full integral type inference, adds torch.full complex type inference (#34709)
Summary:
Per title.

Currently torch.full will always (attempt to) produce a float tensor. This is inconsistent with NumPy in (at least) two cases:

- When integral fill values (including bool) are given
- When complex fill values are given

For example:

```
np.full((1, 2), 1).dtype
: dtype('int64')

np.full((1, 2), (1 + 1j)).dtype
: dtype('complex128')
```

Whereas in PyTorch

```
torch.full((1, 2), 1).dtype
: torch.float32

torch.full((1, 2), (1 + 1j)).dtype
: RuntimeError: value cannot be converted to type float without overflow: (1,1)
```

This PR begins the process of deprecating our current behavior of returning float tensors (by default) when given integer fill values by warning the user that integer fill values will require explicitly specifying the dtype or out kwargs in 1.6, and in 1.7 the behavior will change to return a LongTensor by default (BoolTensor for bool values). The intermediate 1.6 release is to prevent changing the behavior silently and unexpectedly.

The PR also implements inference for complex types. So that with it:

```
torch.full((1, 2), (1 + 1j)).dtype
: torch.complex64
```

The complex type inference returns a ComplexFloat tensor when given a complex fill value (and no dtype or out kwarg is specified), unless the default dtype is Double, in which case a ComplexDouble tensor is returned.

A test for these behaviors is added to test_torch.py.

Implementation note:

This PR required customizing full's dispatch because currently in eager codegen the TensorOptions object passed to functions improperly sets has_dtype() to true, even if the user did not explicitly provide a dtype. torch.arange already worked around this issue with its own custom implementation. The JIT, however, does pass a properly constructed TensorOptions object.

Future Work:

This PR does not extend torch.full's complex type inference to ONNX. This seems unlikely to come up and will be a clear error if it does. When integer type inference is added to torch.full, however, then porting the behavior to ONNX may be warranted. torch.arange ported its complex type promotion logic to ONNX, for example.

Additionally, this PR mostly leaves existing call sites in PyTorch that would trigger this warning intact. This is to be more minimal (since the PR is BC breaking). I will submit a separate PR fixing PyTorch's call sites.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34709

Differential Revision: D20509387

Pulled By: mruberry

fbshipit-source-id: 129593ba06a1662032bbbf8056975eaa59baf933
2020-03-18 12:19:31 -07:00
f3b8a470e1 Added functionality for all to take Lists as input (#34582)
Summary:
New pull request after rebase error in pull request https://github.com/pytorch/pytorch/issues/33923
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34582

Differential Revision: D20447689

Pulled By: eellison

fbshipit-source-id: 4296b64185eccb136b1b614b532deb3af20c7544
2020-03-18 12:01:30 -07:00
d0577e19f0 Revert D20346700: [pytorch][PR] Eager autocasting, out-of-place ops only
Test Plan: revert-hammer

Differential Revision:
D20346700

Original commit changeset: 12d77b391731

fbshipit-source-id: 108d72bf24232f443c0be293ec932c0c478d6a60
2020-03-18 11:42:51 -07:00
b35e544772 Minor fixes for RPC API doc (#34955)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/34955

Test Plan: Imported from OSS

Differential Revision: D20512262

Pulled By: mrshenli

fbshipit-source-id: 86ed099638fd32dc8fbde5a6f284239b146fd5e9
2020-03-18 11:20:32 -07:00
d29f450e63 Revert D20442573: [RPC] Use qualified name str directly in RPC torch script code path
Test Plan: revert-hammer

Differential Revision:
D20442573

Original commit changeset: 87f8b7d94adc

fbshipit-source-id: db0f10c28352d2b3ca21b5357e8e09c01a50018c
2020-03-18 11:00:09 -07:00
689598df0b [quant][graphmode] insert quant/dequant work for duplicated debugName (#34315)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34315

previously we register quantization parameter attributes using debugName of
the observed value, but debugName is not unique, this PR addresses this problem
by making attribute names unique

Test Plan:
python test/test_jit.py

Imported from OSS

Differential Revision: D20504455

fbshipit-source-id: 6dd83bdfc4e4dc77ad3af3d5b48750fb01b2fce1
2020-03-18 10:49:25 -07:00
aaa8f02156 Eager autocasting, out-of-place ops only (#32140)
Summary:
Initial integration of eager autocasting, supporting out-of-place ops only for easier review.
Relevant issue/RFC: https://github.com/pytorch/pytorch/issues/25081

In-place ops and ops with user-supplied `out=...` can certainly be supported as well (my initial WIP https://github.com/pytorch/pytorch/pull/29552 handled many) but require substantially more complex special casing in the autocasting backend and tests.  Support for these ops (much of which has already been written) will be broken into later PRs.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32140

Differential Revision: D20346700

Pulled By: ezyang

fbshipit-source-id: 12d77b3917310186fbddf11c59b2794dc859131f
2020-03-18 10:28:21 -07:00
fa5bc9fa2e Fix problem in NHWC max_pool2d; use accumulate type in NHWC max_pool2d (#34934)
Summary:
This PR would fix https://github.com/pytorch/pytorch/issues/34736. Both code snippet in that issue can now execute normally. More tests are also added.

This PR is a follow-up on https://github.com/pytorch/pytorch/issues/34519, where one variable was mistakenly missed when updating the max_pool2d kernel.

This PR also uses accumulate type of scalar_t in the backward kernel, which resolves the numerical precision issue when stride < kernel_size on fp16.

cc csarofeen ptrblck jjsjann123 VitalyFedyunin ngimel
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34934

Differential Revision: D20512062

Pulled By: VitalyFedyunin

fbshipit-source-id: a461ebbb3e3684aa183ae40e38d8f55bb6f4fee1
2020-03-18 08:32:10 -07:00
d927d58c2a Revert D20289209: Support RowWiseSparseAdam on GPU
Test Plan: revert-hammer

Differential Revision:
D20289209

Original commit changeset: a7a8a21bd18c

fbshipit-source-id: 4a8ae684d099a5499c28b7e65578fc7ab10b248d
2020-03-18 07:35:07 -07:00
a1eaaea288 Revert D20497453: [pytorch][PR] Makes floor_divide a method, adds sparse floor division
Test Plan: revert-hammer

Differential Revision:
D20497453

Original commit changeset: ac326f2007d8

fbshipit-source-id: b94b89b1a25521506e3d0a6b072d3d4d8c55e63d
2020-03-18 01:48:50 -07:00
a3de359464 Do not throw from CUDAContext destructor (#34756)
Summary:
Throwing from destructor leads to undefined behaviour (most often to segault)
So it's better to leak memory then segault
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34756

Test Plan: Run `test_pytorch_onnx_caffe2`

Differential Revision: D20504228

Pulled By: malfet

fbshipit-source-id: 7a05776fea9036f602e95b8182f8493cb5886dab
2020-03-18 00:13:18 -07:00
b7129050e7 Makes floor_divide a method, adds sparse floor division (#34552)
Summary:
(Updated per review feedback)

`torch.floor_divide` is currently a function that can operate on two tensors or a tensor and a scalar (scalar x scalar floor division is handled natively by Python and the JIT has a builtin function for it). This PR updates it to:

- have an out variant: `floor_divide(x, y, out=z)`
- be a method on a tensor: `x.floor_divide(y)`
- have an in-place variant: `x.floor_divide_(y)`
- work with sparse tensors

Tests are added to test_sparse.py and test_torch.py for these new behaviors.

In addition, this PR:

- cleans up the existing sparse division and true_division code and improves their error message
- adds testing of sparse true_division to test_sparse.py
- extends existing floor_divide testing in test_torch to run on CUDA, too, not just the CPU

Unfortunately, making floor_divide a method requires breaking backwards compatibility, and floor_divide has been added to the BC whitelist since this is international. The BC issue is that the first parameter name to torch.floor_divide is changing from input to self. If you previously called torch.floor_divide with keyword arguments, e.g. torch.floor_divide(input=x, other=y), you will need to update to torch.floor_divide(self=x, other=y), or the more common torch.floor_divide(x, y).

The intent of this PR is to allow floor_divide to be substituted for division (torch.div, /) wherever division was previously used. In 1.6 we expect torch.div to perform true_division, and floor_divide is how users can continue to perform integer division with tensors.

There are two potential follow-up issues suggested by this PR:

- the test framework might benefit from additional tensor construction classes, like one to create dividends and divisors for multiple dtypes
- the test framework might benefit from a universal function test class. while methods have reasonable coverage as part of test_torch.py's TestTensorOp tests, function coverage is spotty. Universal functions are similar enough it should be possible to generate tests for them.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34552

Differential Revision: D20497453

Pulled By: mruberry

fbshipit-source-id: ac326f2007d8894f730d1278fef84d63bcb07b5d
2020-03-18 00:01:45 -07:00
bcbdba450c [caffe2] open source 2/4-bit SLS operators (#34903)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34903

Reattempt of D20461609

Moving 2/4-bit SLS and row-wise 2/4-bit conversion operator to open source to be used by DLRM

Test Plan: CI

Reviewed By: jianyuh

Differential Revision: D20495304

fbshipit-source-id: 66a99677583f50fd40e29c514710c7b1a8cdbc29
2020-03-17 22:55:10 -07:00
d7e4a379a0 [C++ API Parity] LBFGS optimizer step() update and added closure to the Optimizer step() function (#34564)
Summary:
Follow-ups after this PR:

* Remove `LossClosureOptimizer`, and merge `Optimizer` into `OptimizerBase` (and rename the merged class to Optimizer)
* Merge the LBFGS-specific serialize test function and the generic `test_serialize_optimizer` function, possibly by passing a bool `has_only_global_state` flag into the `test_serialize_optimizer` function to denote whether `size()` should be equal to 1 or 2?
    * https://github.com/pytorch/pytorch/pull/34564#discussion_r393780303
* It seems that we don't have the equivalent `XORConvergence_LBFGS` test like the other optimizers, and it would be good to add one
* Remove mentions of `parameters_` in optimizer.cpp, de-virtualize all functions, and remove the `OptimizerBase(std::vector<Tensor> parameters)` constructor from `OptimizerBase`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34564

Test Plan: Imported from GitHub, without a `Test Plan:` line.

Differential Revision: D20495701

Pulled By: anjali411

fbshipit-source-id: 6d35286d2decb6f7dff93d9d3e57515770666622
2020-03-17 22:27:24 -07:00
df20f5b374 Updating submodules
Summary:
GitHub commits:

70331595ce
51ae830b00

Test Plan: n/a

Reviewed By: zpao

fbshipit-source-id: 045a70a24059fc1120d54d5b85ffe0e2831d2161
2020-03-17 21:34:16 -07:00
130e720784 [torchbind] Add more comprehensive docscrings (#34906)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/34906

Test Plan: Imported from OSS

Differential Revision: D20496221

Pulled By: jamesr66a

fbshipit-source-id: 3863ec77324564f6f0f1c54b0cbd6c29d12f3c74
2020-03-17 20:41:18 -07:00
09a7788a2f [torchbind] Improve IValue custom class API and remove most Capsule stuff (#34848)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/34848

Test Plan: Imported from OSS

Differential Revision: D20480514

Pulled By: jamesr66a

fbshipit-source-id: 1c595faf34e00aab0a6202a8902426bd310551c3
2020-03-17 20:39:34 -07:00
c4fdba326d Support using self as the destination in rpc.remote for builtin operators (#34931)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/34931

Test Plan: Imported from OSS

Differential Revision: D20503571

Pulled By: mrshenli

fbshipit-source-id: ed1454a349798b18b9953bbf13c86bc43d3b559d
2020-03-17 20:30:19 -07:00
b5edf329f8 [JIT] Make RPC RRef Owner WorkerInfo.name available to TorchScript (#34896)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34896

Make TorchScript support calling ref.owner() to get owner worker id and calling ref.owner_name() to get owner worker name.

Differential Revision: D7652208

fbshipit-source-id: a60125bb316ac2cf19a993cbd2affc933c0af7c9
2020-03-17 20:28:18 -07:00
95f1cb34b9 Revert D20480546: adds quantized implementation of hard sigmoid
Test Plan: revert-hammer

Differential Revision:
D20480546

Original commit changeset: 9febcb44afd9

fbshipit-source-id: 4461b455e63448cf45237e23c988b492c3e0f1b0
2020-03-17 19:58:08 -07:00
ff3d205ee5 [rpc] handle exceptions in ProcessGroupAgent::enqueueRecv (#34413)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34413

In this diff we have made various improvements to ProcessGroupAgent in order to accomodate edge and error cases such as a "non-clean" shutdown (shutdowns in which we abort RPC as quickly as possible, and don't wait for all pending work across all RPC agents to be completed):

1. Catch and log exceptions in `enqueueRecv`. This prevents us from calling `std::terminate()` in a different thread and logs an error message indicating the issue. With this we no longer have crashes caused by exceptions in this thread during non-graceful shutdown.

2. Provide cleaner error messages everywhere (and use `c10::str` where possible). One example is in `agent::send()`.

3. Add the ability to abort pending sends that cause blocking waits in `handleSend`. The reason we need to abort this is since during a non-graceful shutdown, we could become blocked waiting for these since there is no guarantee the remote end is still active and this would result in a long wait and eventual timeout. We abort these by adding them to a map, and go through this map during `shutdown()`.

4. Fix flaky tests: `test_handle_send_exceptions` and `test_backward_node_failure` and `test_backward_node_failure_python_udf`. These tests were flaky since they dealt with non-graceful shutdown of workers which has chances for a bunch of edge cases explained above.

We have also refactored `createExceptionResponse`, `enqueueRecv`, and some test functions for the above reasons in this diff.

For testing:
Ensured that the tests are no longer flaky with 500 tests runs. Previously, these tests were flaky and disabled. Also added a unit test in the internal `ProcessGroupAgentTest.cpp`.
ghstack-source-id: 100311598

Test Plan: Ensured that the tests are no longer flaky with 500 tests runs. Previously, these tests were flaky and disabled. Also added a unit test in the internal `ProcessGroupAgentTest.cpp`.

Reviewed By: mrshenli

Differential Revision: D20269074

fbshipit-source-id: de9cad7f7185f9864ffbb6b14cd8ca9f6ff8f465
2020-03-17 19:01:41 -07:00
1c8e086537 [quant][graphmode][refactor] Change QParamMap to QParamVector (#34314)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/34314

Test Plan:
.

Imported from OSS

Differential Revision: D20493032

fbshipit-source-id: fd945b861ae08e1d97f154aa2b1fb3099761882b
2020-03-17 18:35:15 -07:00
4bd3e9b41b fix barrier in jit test (#34901)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34901

init_pg is needed for dist.barrier call, otherwise default process group may not be found for some rpc backend
ghstack-source-id: 100319642

Test Plan: unit  test

Differential Revision: D20495321

fbshipit-source-id: a44241bd2ff6e1404eee9b241270a94e9fd114d0
2020-03-17 18:19:08 -07:00
74a28ff1dd Make checkInputs more robust (#34838)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/34838

Differential Revision: D20500828

Pulled By: Krovatkin

fbshipit-source-id: 7eff720dff2698423f3e65b3809ff6f598f936d7
2020-03-17 17:51:12 -07:00
e43c2d59dd Reduce memory overhead of categorical.sample (#34900)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/34714 (using the discussed solution). Thanks to jjabo for flagging and suggesting this.

Instead of expanding `probs` to prepend `sample_shape`, it is better  to use the `num_samples` argument to `torch.multinomial` instead, which is faster and consumes lesser memory.

Existing tests should cover this. I have profiled this on different inputs and the change results in faster `.sample` (e.g. 100X faster on the example in the issue), or at worst is similar to what we have now with the default `sample_shape` argument.

cc. fritzo, alicanb, ezyang
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34900

Differential Revision: D20499065

Pulled By: ngimel

fbshipit-source-id: e5be225e3e219bd268f5f635aaa9bf7eca39f09c
2020-03-17 17:49:41 -07:00
85c51a8c10 Fix dist autograd context Example block format (#34921)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/34921

Test Plan: Imported from OSS

Differential Revision: D20500012

Pulled By: mrshenli

fbshipit-source-id: 6c81123ad347726032c29630d7bf58feb6d8c5fd
2020-03-17 17:44:14 -07:00
f05abd1259 Fix example block format in Distributed Optimizer API doc (#34919)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/34919

Test Plan: Imported from OSS

Differential Revision: D20500013

Pulled By: mrshenli

fbshipit-source-id: d28cbdd1ec207e1e8501ce389b7040fb764f12ca
2020-03-17 17:44:09 -07:00
e87db8a77b Fix example format in Distributed Autograd doc (#34914)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/34914

Test Plan: Imported from OSS

Differential Revision: D20500015

Pulled By: mrshenli

fbshipit-source-id: 55715fd1ffce143952d3f6ffcf60ee83ade0efb4
2020-03-17 17:44:01 -07:00
552f9d3a68 Minor fixes for RPC API docs (#34890)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/34890

Test Plan: Imported from OSS

Differential Revision: D20491788

Pulled By: mrshenli

fbshipit-source-id: 95a9821d70e0afe51f586b891845b3106c7105ce
2020-03-17 17:43:55 -07:00
3c48aadd98 Update descriptions for transmitting CUDA tensors (#34888)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/34888

Test Plan: Imported from OSS

Differential Revision: D20491408

Pulled By: mrshenli

fbshipit-source-id: 4ca35ac9edd4c1af4f2bae2cfb0f1f6060658d5c
2020-03-17 17:43:48 -07:00
800bdcf000 Removing experimental tag in for RPC and adding experimental tag for RPC+TorchScript (#34887)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/34887

Test Plan: Imported from OSS

Differential Revision: D20491409

Pulled By: mrshenli

fbshipit-source-id: ce79c9706eb70a3a52a4032de4f0bd538b694332
2020-03-17 17:43:42 -07:00
6446ccce76 Adding warnings for async Tensor serialization in remote and rpc_async (#34885)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/34885

Test Plan: Imported from OSS

Differential Revision: D20491279

Pulled By: mrshenli

fbshipit-source-id: 8c861e7c7e9ea39f9427f80bc4e75c72c0087366
2020-03-17 17:43:35 -07:00
0d857d55b9 Add a warning for RRef serialization (#34884)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/34884

Test Plan: Imported from OSS

Differential Revision: D20491278

Pulled By: mrshenli

fbshipit-source-id: fd00701fd0090639ffe392f40610426c78bc9269
2020-03-17 17:40:55 -07:00
f87cd83d11 Append multiple arguments to list of flags as multiple items (#34899)
Summary:
This makes PyTorch compileable(but not linkable) with `CUDA_SEPARABLE_COMPILATION` option enabled.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34899

Test Plan: CI

Differential Revision: D20501050

Pulled By: malfet

fbshipit-source-id: 02903890a827fcc430a26f397d4d05999cf3a441
2020-03-17 16:48:32 -07:00
841f7600bb [quant][graphmode] Quantization pattern for aten::linear (#33854)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33854

att

Test Plan:
python test/test_jit.py

Imported from OSS

Differential Revision: D20493031

fbshipit-source-id: bafd0a3ba5d07327d451b3915f043db33b012b53
2020-03-17 16:36:30 -07:00
71f02a481b [RPC] Avoid polluting Python root logger on importing "torch" module (#34871)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34871

We used to configure root logger in RPC module. A stream handler is added to `root.handlers`. This is not desired behavior for pytorch users. We should instead keep the root logger handler list untouched.

We can configure the logger local to the rpc module, set it's log level, so it doesn't use it's ancestor, which is usually the root which has no stream handlers in most cases.
https://docs.python.org/3/library/logging.html#logging.Logger.setLevel

And add a stream handler to make it output to stdout, even if the root handlers is not configured and has an empty list.
https://docs.python.org/3/library/logging.html#logging.Logger.addHandler
https://docs.python.org/3/library/logging.handlers.html#logging.StreamHandler
ghstack-source-id: 100322141

Test Plan:
```
buck test mode/dev-nosan //caffe2/test/distributed/rpc:rpc_fork

buck build mode/dev-nosan //caffe2/test/distributed/rpc:rpc_fork \
&& buck-out/gen/caffe2/test/distributed/rpc/rpc_fork\#binary.par -r test_wait_all_workers
```

Differential Revision: D7677493

fbshipit-source-id: 88a66079e7348c79a7933e3527701917cbebb7ba
2020-03-17 16:07:06 -07:00
58c5b6d306 adds quantized implementation of hard sigmoid (#34607)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34607

Adds quantized version of hardsigmoid activation.

Note: not implementing the _ and .out versions is
currently intended, because the implementation changes the scale and
zp and it's nice to not allow the user to specify scale
and zp.  Lmk if we should handle this differently.

Test Plan:
tests
benchmarks

Imported from OSS

Differential Revision: D20480546

fbshipit-source-id: 9febcb44afd920125ed2ca4900492f0b712078ea
2020-03-17 16:01:39 -07:00
97757dca79 Format register_ditributed_ops.cpp (#34922)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34922

format

Test Plan: `

Differential Revision: D7717743

fbshipit-source-id: 207bd46a6b0579adbd35f6417af239ec717c7a41
2020-03-17 15:42:18 -07:00
0216c76e12 SNIFAE Template Constructors of IValue (#34647) (#34843)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34843

Currently, we use not_ok_to_boxing to filter Dimname that can not be
converted/constructed to IValue. The correct way should be SNIFAE the
constructor of IValue.

(Note: this ignores all push blocking failures!)

Test Plan:
PyTorch compiled after the code change.

All unit test passed

Imported from OSS

Differential Revision: D20494886

fbshipit-source-id: 91dfba6a41a3ae2d6ceba9d4124cbf612ea3f080
2020-03-17 15:40:48 -07:00
959a7138fd Support RowWiseSparseAdam on GPU (#34341)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34341

Implement RowWiseSparseAdam on CUDA

Reviewed By: xianjiec

Differential Revision: D20289209

fbshipit-source-id: a7a8a21bd18c1b9891f04f202d3ecaf183e30cad
2020-03-17 15:08:24 -07:00
72e3d66f50 [ROCm] Fix for std::isnan regression in ROCm (#34664)
Summary:
Filing this PR since we are in the process of migrating ROCm CI to ROCm version 3.1. This patch is to ensure the correct functionality of float <-> bfloat16 conversion in rocm3.1. `std::isnan` regresses with rocm3.1.

iotamudelta ezyang

cc: ashishfarmer (original author of this patch)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34664

Differential Revision: D20440972

Pulled By: ezyang

fbshipit-source-id: 1ccb911c88f05566d94e01878df6c70cf7f31242
2020-03-17 15:03:17 -07:00
b227ea955e .circleci: Remove should_run_job, no longer needed (#34326)
Summary:
Done at the recommendation of ezyang

TODO:

- [x] Sync `XImportant`
Signed-off-by: Eli Uriegas <eliuriegas@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34326

Differential Revision: D20496786

Pulled By: seemethere

fbshipit-source-id: 8c84e097d81db28d7dcda8720973bce77f6eb4f7
2020-03-17 15:01:59 -07:00
5857a125df Turn on exact_dtype by default on test_optim.py (#34825)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/34825

Test Plan: Imported from OSS

Differential Revision: D20498111

Pulled By: great-way

fbshipit-source-id: e689ca40c496b6b4cccb0df30bdae89b2c024f31
2020-03-17 14:41:13 -07:00
a4224886f3 Eliminate guards through max_pool ops. (#34512)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/34512

Differential Revision: D20478962

Pulled By: resistor

fbshipit-source-id: 86fc926305f95cae8b334ed344d8e0cdd1ef7b2b
2020-03-17 14:00:00 -07:00
6b701de130 Add types argument to __torch_function__ (#34303)
Summary:
This PR adds the `types` argument to `__torch_function__` as per RFC 0001: https://github.com/pytorch/rfcs/pull/3
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34303

Differential Revision: D20474992

Pulled By: ezyang

fbshipit-source-id: cdd40b3b38f3bda4ece8812a629f5db87e919d01
2020-03-17 13:32:00 -07:00
275f5c8049 setup.py: Add numpy as required for install_requires (#34510)
Summary:
Was originally not a requirement but we should add it back here since
it's required on import and we require it anyways for our conda
packages.

Tested with:

```
❯ pkginfo -f requires_dist *.whl
requires_dist: ['numpy']
```

Signed-off-by: Eli Uriegas <eliuriegas@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34510

Differential Revision: D20352125

Pulled By: seemethere

fbshipit-source-id: 383e396fe500ed7043d83c3df57d1772d0fff1e6
2020-03-17 13:31:55 -07:00
940e678da9 Add back cudaHostRegister to cudart API. (#34665)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34665

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Differential Revision: D20493861

Pulled By: ezyang

fbshipit-source-id: 4215e3037a16be460f20cfc2859be5ee074128d3
2020-03-17 13:30:39 -07:00
7a3cf67fd8 Implement channels last upsample2d/3d forward pass kernel. (#34597)
Summary:
Thi PR implement channel last upsampling nearest for 2D/3D.
This is supposed to be faster, plus, avoids converting formats going in
and out of operator.
Will post benchmarking numbers.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34597

Test Plan: python test/test_nn.py TestNN.test_upsamplingNearest3d_channels_last

Differential Revision: D20390583

Pulled By: kimishpatel

fbshipit-source-id: e0162fb97604a261887f38fc957d3f787c80954e
2020-03-17 13:04:42 -07:00
3ad7dfa2cf move emulation libraries to contrib (#34861)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34861

start with unary ops

Test Plan:
buck test //glow/fb/test/numerics/...

```
[hyz@devgpu019.snc1 ~/fbsource/fbcode/caffe2/caffe2/contrib] buck test //glow/fb/test/numerics/...
Action graph will be rebuilt because files have been added or removed.
Parsing buck files: finished in 2.0 sec
Building: finished in 9.8 sec (100%) 14826/14826 jobs, 23 updated
  Total time: 11.9 sec
Trace available for this run at /tmp/testpilot.20200316-143829.59858.log
TestPilot test runner for Facebook. See https://fburl.com/testpilot for details.
Testpilot build revision 7228e74a7f7e8e4934ab79a135930e665ca0e589 fbpkg e6db8251dbeb46b68a52a862744deff4 at Sun Mar  8 21:16:39 2020 by twsvcscm from /data/fbprojects/packages/testinfra.testpilot/795/t.par
/proc/self/fd/4/__monkeytype_main_wrapper__.py:934: DeprecationWarning: Using or importing the ABCs from 'collections' instead of from 'collections.abc' is deprecated since Python 3.3,and in 3.9 it will stop working
Discovering tests
Running 34 tests
Started new test run: https://our.intern.facebook.com/intern/testinfra/testrun/7036874425505432
      ✓ glow/fb/test/numerics:test_operator_onnxifinnpi - test_slsw_all_one_tenth_mel_25 (glow.fb.test.numerics.test_operator_onnxifi.SLSTest) 0.000 1/34 (passed)
      ✓ glow/fb/test/numerics:test_batchnorm_nnpi_fp16nnpi - test_bn (glow.fb.test.numerics.test_batchnorm_nnpi_fp16.BatchnormTest) 1.974 2/34 (passed)
      ✓ glow/fb/test/numerics:test_fc_nnpi_fp16nnpi - test_clip (glow.fb.test.numerics.test_fc_nnpi_fp16.FCTest) 1.371 3/34 (passed)
      ✓ glow/fb/test/numerics:test_batchmatmul_nnpi_fp16nnpi - test_batch_matmul (glow.fb.test.numerics.test_batchmatmul_nnpi_fp16.TestBatchMatMul) 2.993 4/34 (passed)
      ✓ glow/fb/test/numerics:test_operator_onnxifinnpi - test_clip_graph (glow.fb.test.numerics.test_operator_onnxifi.CommonOpsTest) 0.536 5/34 (passed)
      ✓ glow/fb/test/numerics:test_numerics_nnpinnpi - test_accumulator_limits (glow.fb.test.numerics.test_numerics_nnpi.AccTest) 0.472 6/34 (passed)
      ✓ glow/fb/test/numerics:test_operator_onnxifinnpi - test_mat_mul_graph (glow.fb.test.numerics.test_operator_onnxifi.MatMulTest) 0.495 7/34 (passed)
      ✓ glow/fb/test/numerics:test_op_nnpi_fp16nnpi - test_tanh (glow.fb.test.numerics.test_op_nnpi_fp16.UnaryOpTest) 0.573 8/34 (passed)
      ✓ glow/fb/test/numerics:test_operator_onnxifinnpi - test_fc_graph (glow.fb.test.numerics.test_operator_onnxifi.FCTest) 0.793 9/34 (passed)
      ✓ glow/fb/test/numerics:test_operator_onnxifinnpi - test_concat_graph_sampe_shape (glow.fb.test.numerics.test_operator_onnxifi.ConcatTest) 0.441 10/34 (passed)
      ✓ glow/fb/test/numerics:test_sls_nnpi_fp16nnpi - test_small_sls (glow.fb.test.numerics.test_sls_nnpi_fp16.SparseLengthsSumTest) 0.463 11/34 (passed)
      ✓ glow/fb/test/numerics:test_op_nnpi_fp16nnpi - test_add_graph (glow.fb.test.numerics.test_op_nnpi_fp16.ArithmeticOpsTest) 0.772 12/34 (passed)
      ✓ glow/fb/test/numerics:test_operator_onnxifinnpi - test_fp16fc_graph (glow.fb.test.numerics.test_operator_onnxifi.Fp16FCTest) 0.481 13/34 (passed)
      ✓ glow/fb/test/numerics:test_fc_nnpi_fp16nnpi - test_fc_exercise (glow.fb.test.numerics.test_fc_nnpi_fp16.FCTest) 0.495 14/34 (passed)
      ✓ glow/fb/test/numerics:test_operator_onnxifinnpi - test_tanh_graph (glow.fb.test.numerics.test_operator_onnxifi.TanhTest) 0.538 15/34 (passed)
      ✓ glow/fb/test/numerics:test_operator_onnxifinnpi - test_add_graph (glow.fb.test.numerics.test_operator_onnxifi.ArithmeticOpsTest) 0.517 16/34 (passed)
      ✓ glow/fb/test/numerics:test_fc_nnpi_fp16nnpi - test_fc_numeric_cases (glow.fb.test.numerics.test_fc_nnpi_fp16.FCTest) 0.555 17/34 (passed)
      ✓ glow/fb/test/numerics:test_op_nnpi_fp16nnpi - test_sub_graph (glow.fb.test.numerics.test_op_nnpi_fp16.ArithmeticOpsTest) 0.692 18/34 (passed)
      ✓ glow/fb/test/numerics:test_op_nnpi_fp16nnpi - test_sigmoid (glow.fb.test.numerics.test_op_nnpi_fp16.UnaryOpTest) 1.038 19/34 (passed)
      ✓ glow/fb/test/numerics:test_operator_onnxifinnpi - test_sigmoid_graph (glow.fb.test.numerics.test_operator_onnxifi.SigmoidTest) 0.530 20/34 (passed)
      ✓ glow/fb/test/numerics:test_operator_onnxifinnpi - test_div_graph (glow.fb.test.numerics.test_operator_onnxifi.ArithmeticOpsTest) 0.590 21/34 (passed)
      ✓ glow/fb/test/numerics:test_sls_4bit_nnpi_fp16nnpi - test_slws_fused_4bit_rowwise_all_same (glow.fb.test.numerics.test_sls_4bit_nnpi_fp16.SparseLengthsSumTest) 0.607 22/34 (passed)
      ✓ glow/fb/test/numerics:test_op_nnpi_fp16nnpi - test_div_graph (glow.fb.test.numerics.test_op_nnpi_fp16.ArithmeticOpsTest) 0.583 23/34 (passed)
      ✓ glow/fb/test/numerics:test_op_nnpi_fp16nnpi - test_mul_graph (glow.fb.test.numerics.test_op_nnpi_fp16.ArithmeticOpsTest) 0.803 24/34 (passed)
      ✓ glow/fb/test/numerics:test_numerics_nnpinnpi - test_accumulator_simple (glow.fb.test.numerics.test_numerics_nnpi.AccTest) 0.484 25/34 (passed)
      ✓ glow/fb/test/numerics:test_operator_onnxifinnpi - test_slws_fused_8bit_rowwise_length1_graph (glow.fb.test.numerics.test_operator_onnxifi.SLSTest) 9.069 26/34 (passed)
      ✓ glow/fb/test/numerics:test_sls_nnpi_fp16nnpi - test_slws_fused_8bit_rowwise_intel2 (glow.fb.test.numerics.test_sls_nnpi_fp16.SparseLengthsSumTest) 1.741 27/34 (passed)
      ✓ glow/fb/test/numerics:test_operator_onnxifinnpi - test_mul_graph (glow.fb.test.numerics.test_operator_onnxifi.ArithmeticOpsTest) 0.902 28/34 (passed)
      ✓ glow/fb/test/numerics:test_operator_onnxifinnpi - test_sub_graph (glow.fb.test.numerics.test_operator_onnxifi.ArithmeticOpsTest) 0.678 29/34 (passed)
      ✓ glow/fb/test/numerics:test_sls_nnpi_fp16nnpi - test_slws_fused_8bit_rowwise_all_same (glow.fb.test.numerics.test_sls_nnpi_fp16.SparseLengthsSumTest) 0.726 30/34 (passed)
      ✓ glow/fb/test/numerics:test_fc_nnpi_fp16nnpi - test_fc_num0 (glow.fb.test.numerics.test_fc_nnpi_fp16.FCTest) 1.621 31/34 (passed)
      ✓ glow/fb/test/numerics:test_operator_onnxifinnpi - test_slws_fused_8bit_rowwise_graph (glow.fb.test.numerics.test_operator_onnxifi.SLSTest) 10.121 32/34 (passed)
     ✓ glow/fb/test/numerics:test_operator_onnxifinnpi - test_gather_graph (glow.fb.test.numerics.test_operator_onnxifi.CommonOpsTest) 99.675 33/34 (passed)
      ✓ glow/fb/test/numerics:fp16_op_test - FP16Test.4BitFusedSLS_NNPI 0.156 34/34 (passed)
      {emoji:2702} glow/fb/test/numerics:fp16_op_test - FP16Test.4BitFusedSLS_Interpreter 0.000 (OMITTED)
Test output:
> This test was disabled.
> To run this test locally, add the command line flag --run-disabled to your test command (prefix with -- if using buck).
> To view why this is disabled or re-enable this test in the test console, visit https://our.intern.facebook.com/intern/testinfra/testdetail/281474992503783
      ✓ glow/fb/test/numerics:fp16_op_test - main 3.986 (passed)
      ✓ glow/fb/test/numerics:test_numerics_nnpinnpi - main 12.606 (passed)
      ✓ glow/fb/test/numerics:test_sls_nnpi_fp16nnpi - main 12.622 (passed)
      ✓ glow/fb/test/numerics:test_fc_nnpi_fp16nnpi - main 12.688 (passed)
      ✓ glow/fb/test/numerics:test_operator_onnxifinnpi - main 12.688 (passed)
      ✓ glow/fb/test/numerics:test_batchnorm_nnpi_fp16nnpi - main 12.744 (passed)
      ✓ glow/fb/test/numerics:test_batchmatmul_nnpi_fp16nnpi - main 12.763 (passed)
      ✓ glow/fb/test/numerics:test_op_nnpi_fp16nnpi - main 12.800 (passed)
      ✓ glow/fb/test/numerics:test_sls_4bit_nnpi_fp16nnpi - main 13.034 (passed)
Finished test run: https://our.intern.facebook.com/intern/testinfra/testrun/7036874425505432
Summary (total time 134.18s):
  PASS: 43
  FAIL: 0
  SKIP: 0
  FATAL: 0
  TIMEOUT: 0
  OMIT: 1
    glow/fb/test/numerics:fp16_op_test - FP16Test.4BitFusedSLS_Interpreter

```

Reviewed By: yinghai

Differential Revision: D20471053

fbshipit-source-id: 0bd8e69fbb843a02dc031f45a060aa78c602b42c
2020-03-17 12:50:41 -07:00
cfab65d90d Fix CMake Dev warning in caffe2/CMakeLists.txt (#34886)
Summary:
If arguments of `ENDIF()` block are non-empty, they should match corresponding `IF()` BLOCK
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34886

Test Plan: CI

Differential Revision: D20494631

Pulled By: malfet

fbshipit-source-id: 5fed86239b4a0cb4b3aedd02c950c1b800199d2d
2020-03-17 12:19:42 -07:00
3e68d0c5d0 Revert D20461609: [caffe2] open source 2/4-bit SLS operators
Test Plan: revert-hammer

Differential Revision:
D20461609

Original commit changeset: b3ef73ff10f2

fbshipit-source-id: e90ee5e34b1feab5b0bd582ed7e96e37de7044b0
2020-03-17 11:10:10 -07:00
95833a49e6 [TensorExpr] Pull changes from bertmaher/pytorch_fusion. (#34842)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34842

This PR (hopefully the last one of such kind) is merging changes from a
side branch where tensor expessions based fuser work has been done so
far. This PR is is a squashed version of changes in the side branch,
which is available here: https://github.com/bertmaher/pytorch

Differential Revision: D20478208

Test Plan: Imported from OSS

Pulled By: ZolotukhinM

fbshipit-source-id: 21556e009f1fd88099944732edba72ac40e9b9c0
2020-03-17 11:02:48 -07:00
ecd7c0f84c [RPC] Use qualified name str directly in RPC torch script code path (#34733)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34733

simplify
ghstack-source-id: 100292435

Test Plan:
```
buck test mode/dev-nosan //caffe2/test/distributed/rpc:rpc_fork
```

```
buck test mode/dev-nosan //caffe2/test/distributed/rpc/jit:rpc_fork

buck build mode/dev-nosan //caffe2/test/distributed/rpc/jit:rpc_fork \
&& buck-out/gen/caffe2/test/distributed/rpc/jit/rpc_fork\#binary.par \
-r test_return_local_script_class_rref_in_py_and_use_in_script

buck build mode/dev-nosan //caffe2/test/distributed/rpc/jit:rpc_fork \
&& buck-out/gen/caffe2/test/distributed/rpc/jit/rpc_fork\#binary.par \
-r test_return_local_script_module_rref_in_py_and_use_in_script
```

Differential Revision: D20442573

fbshipit-source-id: 87f8b7d94adc03544f8e2955d01cd4702bb31a34
2020-03-17 10:28:52 -07:00
a0b7a39a92 Updating submodules
Summary:
GitHub commits:

eff7e6d11d
7812ac2fa9

Test Plan: n/a

Reviewed By: zpao

fbshipit-source-id: a3f94dd5b48240169296d773b2828cd97b0871dd
2020-03-17 10:02:37 -07:00
65889388d1 Use randomtemp to resolve intermittent cuda build errors (#34777)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/25393.
Core logic of randomtemp: https://github.com/peterjc123/randomtemp/blob/master/randomtemp/randomtemp.cpp
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34777

Differential Revision: D20491243

Pulled By: ezyang

fbshipit-source-id: 76b0e1819ac1e3f760d5451197bd75ea13df1f0b
2020-03-17 09:56:01 -07:00
67cb018462 Print cuda install logs for Windows CI (#34858)
Summary:
Related to https://github.com/pytorch/pytorch/issues/34821.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34858

Differential Revision: D20491248

Pulled By: ezyang

fbshipit-source-id: c6ddd59197a7bce31c1a3ea5dc28b0ee95d5c216
2020-03-17 09:37:25 -07:00
acbca57d18 improve batch_norm contiguous case's performance (#34530)
Summary:
For batch_norm inference contiguous case, we can get a better performance by manually vectorize it.
Test script:
```                                                                                                   X
 import torch
 import torch.nn as nn
 import time

 torch.manual_seed(0)

 for n in [1, 10, 100]:
     for c in [1, 10, 100]:
         for hw in [1, 10, 200]:
             m = nn.BatchNorm2d(c, affine=False)
             m.eval()
             input = torch.randn(20, c, hw, hw)
             # warm up
             for i in range(200):
                 output = m(input)
             fwd_t = 0
             for j in range(1000):
                 t1 = time.time()
                 output = m(input)
                 t2 = time.time()
                 fwd_t = fwd_t + (t2 -t1)

             fwd_avg = fwd_t / 1000 * 1000
             print("size = (%d, %d, %d, %d); compute time is %.4f(ms)" % (n, c, hw, hw, fwd_avg))
```

Before:
```
size = (1, 1, 1, 1); compute time is 0.0110(ms)
size = (1, 1, 10, 10); compute time is 0.0123(ms)
size = (1, 1, 200, 200); compute time is 0.8166(ms)
size = (1, 10, 1, 1); compute time is 0.0107(ms)
size = (1, 10, 10, 10); compute time is 0.0257(ms)
size = (1, 10, 200, 200); compute time is 8.7533(ms)
size = (1, 100, 1, 1); compute time is 0.0122(ms)
size = (1, 100, 10, 10); compute time is 0.1619(ms)
size = (1, 100, 200, 200); compute time is 123.5674(ms)
size = (10, 1, 1, 1); compute time is 0.0109(ms)
size = (10, 1, 10, 10); compute time is 0.0123(ms)
size = (10, 1, 200, 200); compute time is 0.5629(ms)
size = (10, 10, 1, 1); compute time is 0.0107(ms)
size = (10, 10, 10, 10); compute time is 0.0253(ms)
size = (10, 10, 200, 200); compute time is 8.7817(ms)
size = (10, 100, 1, 1); compute time is 0.0120(ms)
size = (10, 100, 10, 10); compute time is 0.1655(ms)
size = (10, 100, 200, 200); compute time is 123.2488(ms)
size = (100, 1, 1, 1); compute time is 0.0109(ms)
size = (100, 1, 10, 10); compute time is 0.0123(ms)
size = (100, 1, 200, 200); compute time is 0.5740(ms)
size = (100, 10, 1, 1); compute time is 0.0108(ms)
size = (100, 10, 10, 10); compute time is 0.0257(ms)
size = (100, 10, 200, 200); compute time is 8.7201(ms)
size = (100, 100, 1, 1); compute time is 0.0122(ms)
size = (100, 100, 10, 10); compute time is 0.1628(ms)
size = (100, 100, 200, 200); compute time is 123.1739(ms)
```
After:
```
size = (1, 1, 1, 1); compute time is 0.0105(ms)
size = (1, 1, 10, 10); compute time is 0.0114(ms)
size = (1, 1, 200, 200); compute time is 0.5771(ms)
size = (1, 10, 1, 1); compute time is 0.0105(ms)
size = (1, 10, 10, 10); compute time is 0.0160(ms)
size = (1, 10, 200, 200); compute time is 6.9851(ms)
size = (1, 100, 1, 1); compute time is 0.0122(ms)
size = (1, 100, 10, 10); compute time is 0.0848(ms)
size = (1, 100, 200, 200); compute time is 98.6758(ms)
size = (10, 1, 1, 1); compute time is 0.0105(ms)
size = (10, 1, 10, 10); compute time is 0.0115(ms)
size = (10, 1, 200, 200); compute time is 0.2690(ms)
size = (10, 10, 1, 1); compute time is 0.0105(ms)
size = (10, 10, 10, 10); compute time is 0.0159(ms)
size = (10, 10, 200, 200); compute time is 6.6946(ms)
size = (10, 100, 1, 1); compute time is 0.0123(ms)
size = (10, 100, 10, 10); compute time is 0.0854(ms)
size = (10, 100, 200, 200); compute time is 98.7327(ms)
size = (100, 1, 1, 1); compute time is 0.0107(ms)
size = (100, 1, 10, 10); compute time is 0.0116(ms)
size = (100, 1, 200, 200); compute time is 0.2681(ms)
size = (100, 10, 1, 1); compute time is 0.0104(ms)
size = (100, 10, 10, 10); compute time is 0.0159(ms)
size = (100, 10, 200, 200); compute time is 6.7507(ms)
size = (100, 100, 1, 1); compute time is 0.0124(ms)
size = (100, 100, 10, 10); compute time is 0.0852(ms)
size = (100, 100, 200, 200); compute time is 98.6866(ms)
```
For real modle Resnext101, we can also get **~20%** performance improvement for large batch size,
Test script:
```
 import torch
 import torchvision
 import torch
 import time

 torch.manual_seed(0)
 #torch.set_num_threads(1)

 model = torchvision.models.resnext101_32x8d().eval()

 for batch_size in [1, 64]:
     input = torch.randn(batch_size, 3, 224, 224)
     #warm up
     with torch.no_grad():
         for i in range(5):
             output = model(input)

         fwd_t = 0
         for i in range(10):
             t1 = time.time()
             output = model(input)
             t2 = time.time()
             fwd_t = fwd_t + (t2 - t1)

         time_fwd_avg = fwd_t / 10 * 1000
         print("Throughput of resnext101 with batch_size = %d is %10.2f (imgs/s)" % (batch_size, batch_size * 1000/              time_fwd_avg ))
```
Before:
```
Throughput of resnext101 with batch_size = 1 is       7.89 (imgs/s)
Throughput of resnext101 with batch_size = 64 is      13.02 (imgs/s)

num_threads =1
Throughput of resnext101 with batch_size = 1 is       2.97 (imgs/s)
Throughput of resnext101 with batch_size = 64 is       2.75 (imgs/s)
```
After:
```
Throughput of resnext101 with batch_size = 1 is       8.95 (imgs/s)
Throughput of resnext101 with batch_size = 64 is      15.52 (imgs/s)

num_threads = 1
Throughput of resnext101 with batch_size = 1 is       3.10 (imgs/s)
Throughput of resnext101 with batch_size = 64 is       2.88 (imgs/s)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34530

Differential Revision: D20479560

Pulled By: ngimel

fbshipit-source-id: 2e788ebcd814556116c90553ec61159eeffb3c16
2020-03-17 09:22:35 -07:00
a8ca340ad6 Remove all uses of AT_CHECK and replace them with TORCH_CHECK (#34846)
Summary:
AT_CHECK has been deprecated and provides no more features than
TORCH_CHECK
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34846

Differential Revision: D20481339

Pulled By: mrshenli

fbshipit-source-id: 1777e769a069a78e03118270294e5e273d516ca7
2020-03-17 08:59:02 -07:00
76d9e76b4a Default to erroring when failing to return from non-void function. (#34663)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34663

Been bitten by this so many times.  Never more.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Differential Revision: D20425480

Pulled By: ezyang

fbshipit-source-id: c4489efacc4149c9b57d1b8207cc872970c2501f
2020-03-17 07:31:56 -07:00
d9b97a4ffd [caffe2] open source 2/4-bit SLS operators (#34783)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34783

Moving 2/4-bit SLS and row-wise 2/4-bit conversion operator to open source to be used by DLRM

Test Plan: CI

Reviewed By: yinghai

Differential Revision: D20461609

fbshipit-source-id: b3ef73ff10f2433afe06ffa73fe1145282d9ec4c
2020-03-17 01:00:31 -07:00
089a0a2117 [torchbind] Test moving custom classes to/from IValue (#34847)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/34847

Test Plan: Imported from OSS

Differential Revision: D20480512

Pulled By: jamesr66a

fbshipit-source-id: 87f5f8ea8764e26d383b17e4f72538166ddd0655
2020-03-16 23:57:42 -07:00
699a4ed8f5 [testing][do not land] (#34605)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/34605

Test Plan: Imported from OSS

Differential Revision: D20393219

Pulled By: jamesr66a

fbshipit-source-id: c74d886f5f01061294203a002b72b75a3c446f09
2020-03-16 23:56:00 -07:00
89cbc0edea fix tests that could have racy script module instantiation (#34792)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34792

it is not thread safe to initiate script module in multiple threads.

for both test_remote_script_module and test_torchscript_functions_not_supported, it is possible that client thread is initiating MyScriptModule while server thread is initiating it as well in the same rank process.

removing MyScriptModule instatiation in client thread, it is not needed actually.
ghstack-source-id: 100266609

Test Plan: unit tests

Differential Revision: D20463234

fbshipit-source-id: 6ff70ad90fa50b0b44c78df2495b4bcaabb4487b
2020-03-16 23:14:07 -07:00
e70c28856f [Caffe2] Move more method implementations from tensor.h to tensor.cc (#34811)
Summary:
To speed up compilation time
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34811

Test Plan: CI

Differential Revision: D20476992

Pulled By: malfet

fbshipit-source-id: 922cde93783fbfc04854851d7a05a635d5239792
2020-03-16 22:15:18 -07:00
471ddacd8b Add retry decorator and use it for Hub tests. (#34829)
Summary:
fix https://github.com/pytorch/pytorch/issues/34751
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34829

Differential Revision: D20476231

Pulled By: ailzhang

fbshipit-source-id: eb38ee655e28250352b15e8e37b3b39310a7c378
2020-03-16 20:19:45 -07:00
b336deb6ee [quant][mobile] Not use qnnpack max_pool2d if ceil_mode is true (#34844)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34844

QNNPACK max_pool2d operator does not support ceil_mode so this can cause crashes in the kernel when it is set to true.
We default to the server implementation when ceil_mode is set to true

Test Plan:
python test/test_quantized.py

Imported from OSS

Differential Revision: D20478701

fbshipit-source-id: 7962444ac493f5c3c32a9aa1a7be465e8b84ccc2
2020-03-16 19:27:04 -07:00
1e140c353c [profiler][rpc] fix a race condition in the profiler when multiple threads call (#33719)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33719

We were seeing a strange error where gathering profiler events (specifically `parse_cpu_trace` in `profiler.py`) would fail with the error:
`IndexError: pop from empty list`.

It turned out that this was because for one particular `Event`, there was a pop recorded but not a push. Instead of the `push` event being completely missing, it was overwritten by a completely different event.

After a bunch of debugging, and trying several hypotheses, it turns out that this was a race condition in `RangeEventList::record`. What happened was that different threads would call into `RangeEventList::record` on the same event list instance, and one record would stomp over the data written by the other one. Somehow the data written was a valid `Event` so the error did not manifest itself until the profiler realized a `pop` was missing a matching `push` in the python code.

I fixed this by adding a lock to serialize writes to `RangeEventList::record`.

This PR also makes a small change to pass in the `RecordFunction` name into `popRange`. It makes the debugging easier when investigating the events recorded.

Differential Revision: D20071125

fbshipit-source-id: 70b51a65bcb833a7c88b7462a978fd3a39265f7e
2020-03-16 18:41:16 -07:00
422e348619 Don't run user function until all UserRRefs in the args are confirmed (#34497)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34497

Use a thread_local table to intercept UserRRefs created during user
function args deserialization, and then wait for confirmations of
those UserRRefs before launching the given user function.

Differential Revision: D20347464

Test Plan: Imported from OSS

Pulled By: mrshenli

fbshipit-source-id: 087484a2d2f03fbfb156752ab25653f39b412a07
2020-03-16 18:30:06 -07:00
d876fef743 Fix send count for local RPC (#34809)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/34809

Test Plan: Imported from OSS

Differential Revision: D20470495

Pulled By: mrshenli

fbshipit-source-id: 2d6e2a2889be07fb074443f05db5089291daf8cf
2020-03-16 18:30:01 -07:00
38b2856c71 Split deserialize from runPythonUdf and remove generatePythonUDFResult (#34496)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/34496

Differential Revision: D20347469

Test Plan: Imported from OSS

Pulled By: mrshenli

fbshipit-source-id: b832a3a9e2ef61f149175f737b26f65d63bf797b
2020-03-16 18:28:07 -07:00
ae0c88d6aa .circleci: Add manywheel builds for python 3.8 (#34732)
Summary:
Not entirely sure why this wasn't here before but we definitely need to
test for this.

Closes https://github.com/pytorch/pytorch/issues/34727

Signed-off-by: Eli Uriegas <eliuriegas@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34732

Differential Revision: D20480508

Pulled By: seemethere

fbshipit-source-id: 43bcff679ca35993f6bf1b10980acd7c86f780b1
2020-03-16 17:28:46 -07:00
480d1849b0 [ONNX] Fix for expand -1 dim value (#34069)
Summary:
PyTorch expand allows size with -1 dim value. -1 dim value means to infer the dimension from input tensor. This can be exported to ONNX expand with 1 dim value since ONNX expand supports two-way broadcast.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34069

Reviewed By: hl475

Differential Revision: D20195532

Pulled By: houseroad

fbshipit-source-id: c90e7d51b9d7422c09c5ed6e135ca8263105b8c9
2020-03-16 15:30:20 -07:00
1bac5fd0d3 add hardsigmoid FP operator to PyTorch (#34545)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34545

This is for common operator coverage, since this is widely used.  A future PR
will add the quantized version.

Some initial questions for reviewers, since it's my first FP operator
diff:
* do we need a backwards.out method for this?
* do we need CUDA? If yes, should it be this PR or is it ok to split

Test Plan:
```
// test
python test/test_torch.py TestTorchDeviceTypeCPU.test_hardsigmoid_cpu_float32

// benchmark
python -m pt.hardsigmoid_test
...
Forward Execution Time (us) : 40.315

Forward Execution Time (us) : 42.603
```

Imported from OSS

Differential Revision: D20371692

fbshipit-source-id: 95668400da9577fd1002ce3f76b9777c6f96c327
2020-03-16 15:24:12 -07:00
6d8649dc53 [caffe2] fix Transpose2D calls in NHWC<->NCHW (#34625)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34625

These templated function calls are not specifying the template args correctly.  The first arg is the index type, not the array data type.  That means, right now it's using `T` as the index type as well, which will break if we do a template specialization for uint8_t.  If we omit both, it will correctly infer that the index type is `int` and the data type is `T`.

Reviewed By: BIT-silence

Differential Revision: D20358728

fbshipit-source-id: 8cbd8eeb14bce602c02eb6fce2cc141f0121fa24
2020-03-16 15:18:44 -07:00
31eaeba38a Increase the prec of test_baddbmm (#34764)
Summary:
This test is flaky on my computer, the error is:
```
AssertionError: tensor(1.3351e-05) not less than or equal to 1e-05
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34764

Differential Revision: D20476006

Pulled By: ezyang

fbshipit-source-id: dad7e702275346070552c8a98765c37e6ca2c197
2020-03-16 15:06:01 -07:00
8bae1ed144 PCA and SVD for low-rank matrices, LOBPCG for positive-defined generalized eigenvalue problem - copy (#34721)
Summary:
This is a copy of PR https://github.com/pytorch/pytorch/issues/29488 to help the merging process.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34721

Differential Revision: D20444270

Pulled By: vincentqb

fbshipit-source-id: 042c56c8c0dae37834f52b4aee2deae7dd6fa659
2020-03-16 14:13:30 -07:00
976d6aaa51 Revert D20251830: [TensorExpr] Add tensorexpr benchmarks.
Test Plan: revert-hammer

Differential Revision:
D20251830

Original commit changeset: bafd66ce32f6

fbshipit-source-id: d8aea4b26441d8aba90c11d7350d3424df494052
2020-03-16 13:20:16 -07:00
ef78fa8668 caffe2::OperatorBase do not need to be aware of at::Tensor functions (#34810)
Summary:
Replacing <ATen/core/Tensor.h> with <<ATen/core/TensorBody.h> speeds up compilation of caffe2 operators by 15%
For example, it reduces pool_op.cu compilation from 18.8s to 16s
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34810

Test Plan: CI

Differential Revision: D20472230

Pulled By: malfet

fbshipit-source-id: e1b261cc24ff577f09e2d5f6428be2063c6d4a8b
2020-03-16 12:58:05 -07:00
e93e7b2795 [TensorExpr] Add tensorexpr benchmarks. (#34230)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34230

This PR adds some benchmarks that we used to assess tensor expressions performance.

Differential Revision: D20251830

Test Plan: Imported from OSS

Pulled By: ZolotukhinM

fbshipit-source-id: bafd66ce32f63077e3733112d854f5c750d5b1af
2020-03-16 11:49:39 -07:00
ea5c86c276 [TensorExpr] Add LLVM codegen. (#34228)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34228

This PR adds LLVM codegen to tensor expressions. LLVM is added as an
optional build dependency specified with `USE_LLVM=<path_to_llvm>`
variable. If this variable is not set or LLVM is not found in the
specified path, the LLVM codegen is completely disabled.

Differential Revision: D20251832

Test Plan: Imported from OSS

Pulled By: ZolotukhinM

fbshipit-source-id: 77e203ab4421eb03afc64f8da17e0daab277ecc2
2020-03-16 11:49:34 -07:00
35e7efeb9a [TensorExpr] Add CUDA codegen. (#34227)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34227

This PR adds a CUDA support to tensor expressions.

Differential Revision: D20251836

Test Plan: Imported from OSS

Pulled By: ZolotukhinM

fbshipit-source-id: ab36a55834cceff30c8371fef6cca1054a32f017
2020-03-16 11:49:29 -07:00
42b2c8c65d [TensorExpr] Add a fuser pass based on tensor expressions. (#34226)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34226

LLVM and Cuda backends are added in subsequent PRs, so at this point the fuser is pretty useless, but it still can be tested and its logic is not going to change with addition of the codegens.

Differential Revision: D20251838

Test Plan: Imported from OSS

Pulled By: ZolotukhinM

fbshipit-source-id: 82b0d221fa89904ed526689d02a6c7676a8ce8de
2020-03-16 11:49:24 -07:00
e31d462e92 [TensorExpr] Pull changes to core classes for representing expressions and statements from the side branch. (#34224)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34224

Our development has been happening on a side branch `pytorch_fusion` in
`bertmaher/pytorch` fork. This PR moves changes to the core classes
representing expressions and transformations on them.

At this moment, the tensor expressions are only used in tests.
Subsequent PRs add LLVM and CUDA codegen for tensor expressions and
implement fuser on top of these.

This PR is huge as it is a squashed version of changes in the side
branch. It is not practical to pull changes one by one from the branch,
so here is the squashed version. If you're interested in seeing the
history of changes, please refer to https://github.com/bertmaher/pytorch

Differential Revision: D20251835

Test Plan: Imported from OSS

Pulled By: ZolotukhinM

fbshipit-source-id: 1a871acc09cf3c6f7fb4af40d408cdbb82dc7dab
2020-03-16 11:47:47 -07:00
99b91ee2ad [fix][tiny][caffe2] Avoid triggering errors when allow ratio is 100% (#34757)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/34757

Reviewed By: Wakeupbuddy

Differential Revision: D20451255

fbshipit-source-id: 07997cf31dba653b61d082ec3f28357c3b90c4eb
2020-03-16 11:39:32 -07:00
24c9e61e79 Enable JIT tests on Windows (#27029)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/27029

Reviewed By: eellison

Differential Revision: D20458664

Pulled By: jamesr66a

fbshipit-source-id: 22be918543703869f471e89b3478423198351bf3
2020-03-16 11:26:21 -07:00
1af6002321 Initial implementation of NNPI Int8FC op
Test Plan:
```
 buck test mode/no-gpu glow/fb/test/numerics:test_fc_nnpi_int8nnpi -- --print-passing-detail
```

Reviewed By: hyuen

Differential Revision: D20450490

fbshipit-source-id: c4811cdc994548b6e319d57115434dfc199e07c2
2020-03-16 10:46:17 -07:00
a57f92e4de [jit] copy unused/ignored methods to ScriptModule during compilation (#33981)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33981

Okay it turns out that https://github.com/pytorch/pytorch/pull/29342
deletes actually useful things from the resulting Python module. In
particular, people like having `ignore`'d methods attached so that they
can invoke them from python.

Test Plan: Imported from OSS

Differential Revision: D20171650

Pulled By: suo

fbshipit-source-id: 71862e932c6a56cd055d0cff6657887ee0ceb9a8
2020-03-16 10:38:59 -07:00
cec9758afa [quant][graphmode] Add quantization pattern for quantized::add_relu (#33532)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/33532

Test Plan:
python test/test_jit.py

Imported from OSS

Differential Revision: D20354880

fbshipit-source-id: ea608a5ace395a909851f9e577ffdcb51512a3af
2020-03-16 10:20:57 -07:00
8eaafbd99b Remove unused newWithSize declaration. (#34730)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/34730

Test Plan: Imported from OSS

Differential Revision: D20446078

Pulled By: gchanan

fbshipit-source-id: 0effc088dcba4f60385e3b23fa656cb772a3b7bc
2020-03-16 09:17:54 -07:00
b94d650868 Remove unused newView declaration. (#34729)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/34729

Test Plan: Imported from OSS

Differential Revision: D20446077

Pulled By: gchanan

fbshipit-source-id: b68471aeaf673851bdfc6bb0615aba8ebb883a4c
2020-03-16 09:16:14 -07:00
a66b837b19 Migrate dirichlet_grad from CUDA_tensor_apply4 to TensorIterator (#33996)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/33996

Test Plan: Imported from OSS

Differential Revision: D20196789

Pulled By: VitalyFedyunin

fbshipit-source-id: 69ee720f4f3d8a2df91874b77ee3918ce1b951b2
2020-03-16 08:56:32 -07:00
c3c0cf1591 Migrate binary_cross_entropy_backward from CUDA_tensor_apply4 to (#33995)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33995

TensorIterator

Test Plan: Imported from OSS

Differential Revision: D20196790

Pulled By: VitalyFedyunin

fbshipit-source-id: c0c231a20e6e69fc3c68c3ac5082b20f2feb6158
2020-03-16 08:54:49 -07:00
762be86e63 [C++ API Parity] [Optimizers] added closure to optimizers (#34790)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/34790

Differential Revision: D20468361

Pulled By: anjali411

fbshipit-source-id: 1c6115d735b211dc2bedf002d58931cb32cf657a
2020-03-16 07:51:44 -07:00
bdd7dbfd4b [C++ API] RNN / GRU / LSTM layer refactoring (#34322)
Summary:
This PR refactors RNN / GRU / LSTM layers in C++ API to exactly match the implementation in Python API.

**BC-breaking changes:**
- Instead of returning `RNNOutput`, RNN / GRU forward method now returns `std::tuple<Tensor, Tensor>`, and LSTM forward method now returns `std::tuple<Tensor, std::tuple<Tensor, Tensor>>`, matching Python API.
- RNN / LSTM / GRU forward method now accepts the same inputs (input tensor and optionally hidden state), matching Python API.
- RNN / LSTM / GRU layers now have `forward_with_packed_input` method which accepts `PackedSequence` as input and optionally hidden state, matching the `forward(PackedSequence, ...)` variant in Python API.
- RNN / LSTM / GRU layers no longer have these fields: `w_ih` / `w_hh` / `b_ih` / `b_hh`. Instead, to access the weights and biases of the gates, users should do e.g. `rnn->named_parameters()["weight_ih_l0"]`, which mirrors the Python API `rnn.weight_ih_l0`.
- In `RNNOptions`
    - `tanh()` / `relu()` / `activation` are removed. Instead, `nonlinearity` is added which takes either `torch::kTanh` or `torch::kReLU`
    - `layers` -> `num_layers`
    - `with_bias` -> `bias`
- In `LSTMOptions`
    - `layers` -> `num_layers`
    - `with_bias` -> `bias`
- In `GRUOptions`
    - `layers` -> `num_layers`
    - `with_bias` -> `bias`

The majority of the changes in this PR focused on refactoring the implementations in `torch/csrc/api/src/nn/modules/rnn.cpp` to match the Python API. RNN tests are then changed to reflected the revised API design.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34322

Differential Revision: D20458302

Pulled By: yf225

fbshipit-source-id: ffff2ae1ddb1c742c966956f6ad4d7fba03dc54d
2020-03-15 17:48:29 -07:00
d4f182d06b Add overloaded name to prim operators (#34280)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34280

To have prim ops searchable for lite interpreter, overloaded names need to be added for the operators with the same name but different schema. For example, aten::add in register_prim_ops.cpp. The difference is a combination of args and output type.
`"aten::add(str a, str b) ->str"`
`"aten::add(int a, int b) ->int"`
`"aten::add(float a, float b) ->float"`
`"aten::add(int a, float b) ->float"`
`"aten::add(float a, int b) ->float"`
`"aten::add(Scalar a, Scalar b) ->Scalar"`

Solution:
Use the argument type and/or output type (the same to the existing overloaded names). The overloaded name should be minimum as long as the operators can be differentiated. For other operators please look into the source code change for details.

`"aten::add.str(str a, str b) ->str"`
`"aten::add.int(int a, int b) ->int"`
`"aten::add.float(float a, float b) ->float"`
`"aten::add.int_float(int a, float b) ->float"`
`"aten::add.float_int(float a, int b) ->float"`
`"aten::add.Scalar_Scalar(Scalar a, Scalar b) ->Scalar"`

Test Plan: Imported from OSS

Differential Revision: D20456997

Pulled By: iseeyuan

fbshipit-source-id: 2c3dc324b4a4e045559f62c6cc2a10fbb9a72dcf
2020-03-15 17:05:54 -07:00
c86d1361b8 Removes unused THCTensor_(triu), THCTensor_(div) (#34712)
Summary:
Per title. Dead code removal.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34712

Differential Revision: D20442618

Pulled By: mruberry

fbshipit-source-id: b03aa4984328f94021c1480e21375fd868d6d550
2020-03-15 16:42:35 -07:00
c258e4732a solve conv3d backward get incorrect result problem (#34358)
Summary:
Fix https://github.com/pytorch/pytorch/issues/34344.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34358

Differential Revision: D20461698

Pulled By: ngimel

fbshipit-source-id: 472624d0037ab65d9dcc221f647ec68818be5fc9
2020-03-15 16:15:53 -07:00
7848c229b8 Move min and max(reduce all) to Aten(CPU) (#33936)
Summary:
This PR is about port min and max(reduce all) to Aten.
Performance test script:
```
import torch
import timeit

torch.manual_seed(0)
#torch.set_num_threads(1)

device = "cpu"
print(f'device: {device}')
for op in ('max', 'min'):
    for dtype in ('torch.double', 'torch.float', 'torch.int16', 'torch.int32', 'torch.int64'):
        for n, t in [(20_000, 200000),
                     (200_000, 20000)]:
            print(f'a.{op}(), numel() == {n} for {t} times, dtype={dtype}')
            print(timeit.timeit(f'a.{op}()', setup=f'import torch; a =(torch.torch.randn({n}) * 100).to({dtype})', number=t))
```
Test device: **skx-8180, 2 sockets**
Before:
```
a.max(), numel() == 20000 for 200000 times, dtype=torch.double
2.773961597122252
a.max(), numel() == 200000 for 20000 times, dtype=torch.double
2.3256353894248605
a.max(), numel() == 20000 for 200000 times, dtype=torch.float
3.800648272037506
a.max(), numel() == 200000 for 20000 times, dtype=torch.float
3.31692426931113
a.max(), numel() == 20000 for 200000 times, dtype=torch.int16
2.735901520587504
a.max(), numel() == 200000 for 20000 times, dtype=torch.int16
2.2510280115529895
a.max(), numel() == 20000 for 200000 times, dtype=torch.int32
2.723656536079943
a.max(), numel() == 200000 for 20000 times, dtype=torch.int32
2.228839812800288
a.max(), numel() == 20000 for 200000 times, dtype=torch.int64
2.703160767443478
a.max(), numel() == 200000 for 20000 times, dtype=torch.int64
2.3175809988752007
a.min(), numel() == 20000 for 200000 times, dtype=torch.double
2.820106916129589
a.min(), numel() == 200000 for 20000 times, dtype=torch.double
2.325718787498772
a.min(), numel() == 20000 for 200000 times, dtype=torch.float
3.833602518774569
a.min(), numel() == 200000 for 20000 times, dtype=torch.float
3.316444822587073
a.min(), numel() == 20000 for 200000 times, dtype=torch.int16
2.7308286419138312
a.min(), numel() == 200000 for 20000 times, dtype=torch.int16
2.198460517451167
a.min(), numel() == 20000 for 200000 times, dtype=torch.int32
2.730219766497612
a.min(), numel() == 200000 for 20000 times, dtype=torch.int32
2.2268200274556875
a.min(), numel() == 20000 for 200000 times, dtype=torch.int64
2.7342184390872717
a.min(), numel() == 200000 for 20000 times, dtype=torch.int64
2.320415544323623
```
After:
```
a.max(), numel() == 20000 for 200000 times, dtype=torch.double
1.7767417253926396
a.max(), numel() == 200000 for 20000 times, dtype=torch.double
0.550495645031333
a.max(), numel() == 20000 for 200000 times, dtype=torch.float
1.1113408291712403
a.max(), numel() == 200000 for 20000 times, dtype=torch.float
0.44446005020290613
a.max(), numel() == 20000 for 200000 times, dtype=torch.int16
0.5246349424123764
a.max(), numel() == 200000 for 20000 times, dtype=torch.int16
0.47057845536619425
a.max(), numel() == 20000 for 200000 times, dtype=torch.int32
0.6597231412306428
a.max(), numel() == 200000 for 20000 times, dtype=torch.int32
0.40366593934595585
a.max(), numel() == 20000 for 200000 times, dtype=torch.int64
1.767227927222848
a.max(), numel() == 200000 for 20000 times, dtype=torch.int64
0.6187495030462742
a.min(), numel() == 20000 for 200000 times, dtype=torch.double
1.7881382443010807
a.min(), numel() == 200000 for 20000 times, dtype=torch.double
0.5440589748322964
a.min(), numel() == 20000 for 200000 times, dtype=torch.float
1.1090848250314593
a.min(), numel() == 200000 for 20000 times, dtype=torch.float
0.4293213738128543
a.min(), numel() == 20000 for 200000 times, dtype=torch.int16
0.5207074657082558
a.min(), numel() == 200000 for 20000 times, dtype=torch.int16
0.41422136034816504
a.min(), numel() == 20000 for 200000 times, dtype=torch.int32
0.6145811947062612
a.min(), numel() == 200000 for 20000 times, dtype=torch.int32
0.4172037309035659
a.min(), numel() == 20000 for 200000 times, dtype=torch.int64
1.7397673893719912
a.min(), numel() == 200000 for 20000 times, dtype=torch.int64
0.596766366623342
```
Single thread:
Before:
```
a.max(), numel() == 20000 for 200000 times, dtype=torch.double
2.5068740313872695
a.max(), numel() == 200000 for 20000 times, dtype=torch.double
2.234461876563728
a.max(), numel() == 20000 for 200000 times, dtype=torch.float
3.5549037409946322
a.max(), numel() == 200000 for 20000 times, dtype=torch.float
3.2497852174565196
a.max(), numel() == 20000 for 200000 times, dtype=torch.int16
2.493077039718628
a.max(), numel() == 200000 for 20000 times, dtype=torch.int16
2.171935741789639
a.max(), numel() == 20000 for 200000 times, dtype=torch.int32
2.469274105504155
a.max(), numel() == 200000 for 20000 times, dtype=torch.int32
2.273881389759481
a.max(), numel() == 20000 for 200000 times, dtype=torch.int64
2.5818942049518228
a.max(), numel() == 200000 for 20000 times, dtype=torch.int64
2.2394551979377866
a.min(), numel() == 20000 for 200000 times, dtype=torch.double
2.5894540259614587
a.min(), numel() == 200000 for 20000 times, dtype=torch.double
2.331936141476035
a.min(), numel() == 20000 for 200000 times, dtype=torch.float
3.590122046880424
a.min(), numel() == 200000 for 20000 times, dtype=torch.float
3.255849950015545
a.min(), numel() == 20000 for 200000 times, dtype=torch.int16
2.5205496419221163
a.min(), numel() == 200000 for 20000 times, dtype=torch.int16
2.168218174017966
a.min(), numel() == 20000 for 200000 times, dtype=torch.int32
2.658622432500124
a.min(), numel() == 200000 for 20000 times, dtype=torch.int32
2.3376982398331165
a.min(), numel() == 20000 for 200000 times, dtype=torch.int64
2.496626536361873
a.min(), numel() == 200000 for 20000 times, dtype=torch.int64
2.2504652086645365
```
After:
```
a.max(), numel() == 20000 for 200000 times, dtype=torch.double
1.9525171788409352
a.max(), numel() == 200000 for 20000 times, dtype=torch.double
1.6108122132718563
a.max(), numel() == 20000 for 200000 times, dtype=torch.float
1.2444602297618985
a.max(), numel() == 200000 for 20000 times, dtype=torch.float
0.7705567870289087
a.max(), numel() == 20000 for 200000 times, dtype=torch.int16
0.6575072864070535
a.max(), numel() == 200000 for 20000 times, dtype=torch.int16
0.13242999743670225
a.max(), numel() == 20000 for 200000 times, dtype=torch.int32
0.829406064003706
a.max(), numel() == 200000 for 20000 times, dtype=torch.int32
0.35575105529278517
a.max(), numel() == 20000 for 200000 times, dtype=torch.int64
1.6426756298169494
a.max(), numel() == 200000 for 20000 times, dtype=torch.int64
1.4049720335751772
a.min(), numel() == 20000 for 200000 times, dtype=torch.double
2.029639278538525
a.min(), numel() == 200000 for 20000 times, dtype=torch.double
1.6363644907251
a.min(), numel() == 20000 for 200000 times, dtype=torch.float
1.3821239182725549
a.min(), numel() == 200000 for 20000 times, dtype=torch.float
0.834847847931087
a.min(), numel() == 20000 for 200000 times, dtype=torch.int16
0.6913397628813982
a.min(), numel() == 200000 for 20000 times, dtype=torch.int16
0.1370067736133933
a.min(), numel() == 20000 for 200000 times, dtype=torch.int32
0.8190992185845971
a.min(), numel() == 200000 for 20000 times, dtype=torch.int32
0.3640836915001273
a.min(), numel() == 20000 for 200000 times, dtype=torch.int64
1.6516661625355482
a.min(), numel() == 200000 for 20000 times, dtype=torch.int64
1.4111155439168215
```
Fixes: https://github.com/pytorch/pytorch/issues/33197

Fix https://github.com/pytorch/pytorch/issues/24728, https://github.com/pytorch/pytorch/issues/24729
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33936

Differential Revision: D20461658

Pulled By: ngimel

fbshipit-source-id: 5749260114ace3ea7b513e32edc805c844a19c8a
2020-03-15 16:09:58 -07:00
f058c03b15 Disallow sending CUDA tensors over RPC for current RPC agents. (#33604)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33604

For our current RPC agents, this PR disallows sending CUDA tensors
over RPC and asks users to copy them explicitly to CPU. Currently, this seems
to be the easiest contract to guarantee for our current RPC agents, otherwise
if we do support this transparently it gets a little tricky in terms of whether
a CUDA tensor on the client should be sent to CPU/GPU of the remote end and
also which GPU device on the remote end.

In the future, the TensorPipe RPC agent can have its own specific handling of
CUDA tensors.

Closes https://github.com/pytorch/pytorch/issues/28881
ghstack-source-id: 100166120

Test Plan: waitforbuildbot

Differential Revision: D20020183

fbshipit-source-id: ca4d43d2a24e8fcd3a60b21e654aa0e953e756cb
2020-03-15 15:01:46 -07:00
f404537c26 CUDA Loops: move address computation into policy, make policy.load load all arguments (#33720)
Summary:
So that in the future we can make policy accept an offset calculator in its constructor for the support of non-contiguous tensors.

The `elementwise_kernel_helper` is now very general and it can handle any cases:

```C++
template<typename func_t, typename policy_t>
__device__ inline void elementwise_kernel_helper(func_t f, policy_t policy) {
  using traits = function_traits<func_t>;
  using return_t = typename traits::result_type;
  using args_t = typename traits::ArgsTuple;

  int idx = blockIdx.x;

  return_t results[thread_work_size];
  cuda9::workaround::enable_default_constructor<args_t> args_[thread_work_size];
  args_t *args = reinterpret_cast<args_t *>(&args_);

  // load
  policy.load(args, idx);

  // compute
  #pragma unroll
  for (int i = 0; i < thread_work_size; i++) {
    if (policy.check_inbounds(i)) {
      results[i] = c10::guts::apply(f, args[i]);
    }
  }

  // store
  policy.store(results, idx);
}
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33720

Differential Revision: D20459652

Pulled By: ngimel

fbshipit-source-id: aa8b122e0e8c6e08ab354785e04753ff778882e2
2020-03-15 14:41:05 -07:00
528aabd373 Fix backward compatibility check test for schemas containing (#34782)
Summary:
"torch.classes".
BC check tests skips adding torch.classes based schemas to existing
schemas. Removed the skip.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34782

Test Plan:
cd test/backward_compatibility
python dump_all_function_schemas.py --filename new_schemas.txt
python check_backward_compatibility.py --new-schemas new_schemas.txt

Before this PR fails with:
```
Mar 15 11:12:20 Broken ops: [
Mar 15 11:12:20 	_xnnpack::conv2d_packed(Tensor X, __torch__.torch.classes.XNNPackConv2dOpContext W_prepack) -> (Tensor Y)
Mar 15 11:12:20 	_xnnpack::conv2d_prepack(Tensor W, Tensor? B, int[2] stride, int[2] padding, int[2] dilation, int groups) -> (__torch__.torch.classes.XNNPackConv2dOpContext)
Mar 15 11:12:20 	_xnnpack::linear_packed(Tensor X, __torch__.torch.classes.XNNPackLinearOpContext W_prepack) -> (Tensor Y)
Mar 15 11:12:20 	_xnnpack::linear_prepack(Tensor W, Tensor? B=None) -> (__torch__.torch.classes.XNNPackLinearOpContext)
Mar 15 11:12:20 ]
```
After this PR, it passes.

Reviewed By: houseroad

Differential Revision: D20461994

Pulled By: kimishpatel

fbshipit-source-id: de692644ee7d49accf2d8260cd3a10f6e147653a
2020-03-15 14:35:19 -07:00
15c84c37b6 [PyTorch BC] Clean up the BC whitelist (#34784)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34784

Remove stale items

Test Plan: ci

Reviewed By: hl475

Differential Revision: D20461740

fbshipit-source-id: 46dcc39f3a867165aadee182033b09ca65ee8551
2020-03-15 12:46:57 -07:00
08bc3c6cbf Remove unnecessary import (#34778)
Summary:
https://github.com/pytorch/pytorch/issues/34563 accidentally introduced a lint error due to an unused import. This PR removes this import.

Jit tests run as expected after this change:
```
> python test/test_jit.py
.....
Ran 2435 tests in 100.077s

OK (skipped=140, expected failures=1)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34778

Differential Revision: D20459708

Pulled By: tugrulince

fbshipit-source-id: bb742085fafc849ff3d9507d1557556e01fbeb4b
2020-03-15 09:56:55 -07:00
1d81bd02cc Export roi_align_gradient_op to c10 (#34776)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34776

Export roi_align_gradient_op to c10

Test Plan: unittest

Reviewed By: houseroad

Differential Revision: D20459210

fbshipit-source-id: 80bf065f83bb44b39a150bae25b3591c16f522fa
2020-03-15 02:43:39 -07:00
373c80ee90 Fix missing header (#34762)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34762

So far it's by luck that we somehow include "caffe2/core/tensor.h" before including "caffe2/caffe2/quantization/server/fbgemm_pack_blob.h". This is not safe and this diff fixes it.

Test Plan: unittest

Reviewed By: jianyuh

Differential Revision: D20455352

fbshipit-source-id: 777dae32a23d0ec75fd7e5e1627426b5a5f81f5a
2020-03-15 00:19:42 -07:00
6c555e1508 Revert D20311699: [pytorch][PR] [C++ API] RNN / GRU / LSTM layer refactoring
Test Plan: revert-hammer

Differential Revision:
D20311699

Original commit changeset: e2b60fc7bac6

fbshipit-source-id: 72f4a762189490998d6b716857eeac053a11742d
2020-03-14 16:18:48 -07:00
84bd71dbd4 Enable threading for XNNPACK ops. (#34547)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34547

This enables threading by passing a threadpool to xnnpack ops.

Test Plan:
python test/test_xnnpack_integration.py

Imported from OSS

Differential Revision: D20370553

fbshipit-source-id: 4db08e73f8c69b9e722b0e11a00621c4e229a31a
2020-03-14 12:53:36 -07:00
4da5569300 Pass to remove prepacking ops. (#34319)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34319

Removes prepacking ops and install them as attributes of the top level
module. Needs to run freezing as the first pass.

Test Plan:
python test/test_xnnpack_integration.py

Imported from OSS

Differential Revision: D20290726

fbshipit-source-id: 633ceaa867ff7d5c8e69bd814c0362018394cb3a
2020-03-14 12:53:31 -07:00
7dd5da2026 JIT pass to insert XNNPACK ops (#34048)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34048

Rewrites the graph to insert xnnpack prepack and packed run ops for
conv2d and linear.

Test Plan:
python test/test_xnnpack_integration.py

Imported from OSS

Differential Revision: D20185658

fbshipit-source-id: c4c073c912ad33e822e7beb4ed86c9f895129d55
2020-03-14 12:53:27 -07:00
4c30fc7238 Integrate XNNPACK with custom class for packing weights. (#34047)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34047

This PR integrates the added xnnpack conv2d and linear op via
custom class registration for packed weights. The packed struct
is serializable.

Test Plan:
python test test/test_xnnpack_integration.py

Imported from OSS

Differential Revision: D20185657

fbshipit-source-id: fc7e692d8f913e493b293b02d92f4e78536d7698
2020-03-14 12:51:56 -07:00
e23a9dc140 [C++ API] RNN / GRU / LSTM layer refactoring (#34322)
Summary:
This PR refactors RNN / GRU / LSTM layers in C++ API to exactly match the implementation in Python API.

**BC-breaking changes:**
- Instead of returning `RNNOutput`, RNN / GRU forward method now returns `std::tuple<Tensor, Tensor>`, and LSTM forward method now returns `std::tuple<Tensor, std::tuple<Tensor, Tensor>>`, matching Python API.
- RNN / LSTM / GRU forward method now accepts the same inputs (input tensor and optionally hidden state), matching Python API.
- RNN / LSTM / GRU now has `forward_with_packed_input` method which accepts `PackedSequence` as input and optionally hidden state, matching the `forward(PackedSequence, ...)` variant in Python API.
- In `RNNOptions`
    - `tanh()` / `relu()` / `activation` are removed. Instead, `nonlinearity` is added which takes either `torch::kTanh` or `torch::kReLU`
    - `layers` -> `num_layers`
    - `with_bias` -> `bias`
- In `LSTMOptions`
    - `layers` -> `num_layers`
    - `with_bias` -> `bias`
- In `GRUOptions`
    - `layers` -> `num_layers`
    - `with_bias` -> `bias`

The majority of the changes in this PR focused on refactoring the implementations in `torch/csrc/api/src/nn/modules/rnn.cpp` to match the Python API. RNN tests are then changed to reflected the revised API design.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34322

Differential Revision: D20311699

Pulled By: yf225

fbshipit-source-id: e2b60fc7bac64367a8434647d74c08568a7b28f7
2020-03-14 12:09:04 -07:00
5710374e4e [reland][quant][graphmode] Add quantized conv2d-relu fusion pattern (#33279) (#34744)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34744

att

Test Plan: python test/test_jit.py

Differential Revision: D20449667

Pulled By: jerryzh168

fbshipit-source-id: 01bbc26604fac421dcaacaf4fa1b57731f1f08b7
2020-03-14 01:03:18 -07:00
fb20621b3b Move torchbind out of jit namespace (#34745)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/34745

Test Plan: Imported from OSS

Differential Revision: D20450239

Pulled By: jamesr66a

fbshipit-source-id: 3f5597626f21d7b5e329b57da358c76b531bf806
2020-03-13 23:03:14 -07:00
8a395882ce [quant][onnx] Support conversion of quantized sigmoid operator from pytorch to caffe2 (#34629)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34629

Add support for sigmoid in the conversion flow through onnx

Test Plan:
python test/onnx/test_pytorch_onnx_caffe2_quantized.py TestQuantizedOps.test_quantized_sigmoid
python test/onnx/test_pytorch_onnx_caffe2_quantized.py TestQuantizedOps.test_small_model

Imported from OSS

Differential Revision: D20433680

fbshipit-source-id: 95943e14637d294122e4d102c5c19c06d27064c6
2020-03-13 22:42:06 -07:00
af28915164 [quant][onnx] Add support to convert max_pool2d quantized pytorch op to C2 (#33945)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33945

Add mapping for this operator in symbolics

Test Plan:
python test/onnx/test_pytorch_onnx_caffe2_quantized.py TestQuantizedOps.test_max_pool2d

Imported from OSS

Differential Revision: D20433681

fbshipit-source-id: 88f02ade698262a6f8824671830bc1f7d40bbfa6
2020-03-13 22:40:49 -07:00
d041d0784e [C++ API] RNNCell / LSTMCell / GRUCell layers (#34400)
Summary:
This PR adds `RNNCell` / `LSTMCell` / `GRUCell` layers to the C++ frontend, with implementations exactly matching the Python API equivalent.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34400

Differential Revision: D20316859

Pulled By: yf225

fbshipit-source-id: bb7cee092622334043c0d0fd0fcb4e75e707699c
2020-03-13 21:52:24 -07:00
68758b2fa0 Add the quantized batch_norm3d and also batch_norm3d fused with relu operators (#34702)
Summary:
as title, for bringing up the quantized video model. Will add the batch_norm_relu test in another PR.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34702

Differential Revision: D20436092

Pulled By: lly-zero-one

fbshipit-source-id: 116bd306f7880bfd763d8575654fbd6c92818338
2020-03-13 20:30:28 -07:00
da11646db1 [C++ API] Link to module options doc for functional that has same options as module (#34752)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/34752

Test Plan: Imported from GitHub, without a `Test Plan:` line.

Differential Revision: D20452681

Pulled By: yf225

fbshipit-source-id: 06b56a08bd480999353ebbff39c035225e4070df
2020-03-13 20:19:43 -07:00
7dee36a061 .circleci: Remove CUDA 10.0, no longer needed (#34726)
Summary:
Since we've added CUDA 10.2, it is time to retire CUDA 10.0

Signed-off-by: Eli Uriegas <eliuriegas@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34726

Differential Revision: D20453081

Pulled By: seemethere

fbshipit-source-id: fd5bb35325a5f1577d0f0404d16cd7dfe34c86ad
2020-03-13 18:55:45 -07:00
52005b551c invokeOperatorFromPython: support overloaded operator calling (#34671)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34671

Like the python arg parser, this tries to convert to the schema in order.
It introduces schema_match_exception which gets thrown when the schema doesn't match,
allowing the overload handler to try the next option.

Behavior will not 100% match the schema argument parser but should work for
simple cases using custom binding.

Test Plan: Imported from OSS

Differential Revision: D20432206

Pulled By: zdevito

fbshipit-source-id: 280839a2205ea3497db3a9b5741fccc1e2bff9a8
2020-03-13 18:46:03 -07:00
ab76a8206f [JIT][mobile] Support built-in Function call in lite interpreter (#34676)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/34676

Test Plan: Imported from OSS

Differential Revision: D20427938

Pulled By: jamesr66a

fbshipit-source-id: 79eebfa858776f26da55ffd49d3f78fa7ae0df9b
2020-03-13 18:24:18 -07:00
af3a7e2b50 [jit] small cleanups after script:: removal (#34677)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34677

1. Remove remaining uses of `script::` namespace from the codebase,
2. Add one more typedef for `script::ExtraFilesMap` which is part of the
public interface.

Pull Request resolved: #34580

Test Plan: Imported from OSS

Reviewed By: zdevito

Differential Revision: D20431739

Pulled By: suo

fbshipit-source-id: a29d369c755b6506c53447ca1f286b6339222c9a
2020-03-13 17:56:16 -07:00
e7910aa9e5 [fix] use non-inplace for insert observer pass (#34190)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34190

inplace modification of ClassType might affect other tests, so we want to do non-inplace modifications.
Actually the inplace argument will be removed soon.

Test Plan:
ci

Imported from OSS

Differential Revision: D20451765

fbshipit-source-id: e87ad528c4e7f84f5774b94a8e3e85568269682d
2020-03-13 17:25:07 -07:00
1734bd6871 skip mask_rcnn test (#34734)
Summary:
fix master
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34734

Differential Revision: D20447607

Pulled By: eellison

fbshipit-source-id: 165c64f0484abf068b7d3a204a6bcb623ffe0910
2020-03-13 15:50:49 -07:00
6d790c3611 Mark PyTorch incompatible with python-3.6.0 (#34724)
Summary:
Per https://github.com/pytorch/pytorch/issues/19161 PyTorch is incompatible with 3.6.0 due to the missing `PySlice_Unpack`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34724

Test Plan: CI + try to load pytorch binary using python-3.6.0

Differential Revision: D20449052

Pulled By: malfet

fbshipit-source-id: 2c787fc64f5d1377c7f935ad2f3c77f46723d7dd
2020-03-13 15:22:34 -07:00
aedffdf7d8 Support for Tensor Shape Type Hint (#34595)
Summary:
This PR is related to [https://github.com/pytorch/pytorch/issues/33953](https://github.com/pytorch/pytorch/issues/33953).
I've created a directory `type_hint_tests` for the example as suggested by zou3519 [here](https://github.com/pytorch/pytorch/issues/33953#issuecomment-597716405). This directory is supposed to contain examples over which mypy will run. I've added the test in `test/test_type_hints.py`.
The test can simply be invoked by
```
$ python3 test/test_type_hints.py
Fail to import hypothesis in common_utils, tests are not derandomized
.b'test/type_hint_tests/size.py:7: error: Tuple index out of range\ntest/type_hint_tests/size.py:8: error: Tuple index out of range\n'
.
----------------------------------------------------------------------
Ran 2 tests in 13.660s

OK

```
Note that I've not made the change of fixing the stub to show that the test works. The issue can be fixed by changing definition of Size in `class Size(Tuple[_int, ...]): ... ` in `/torch/__init__.pyi.in`.
After changing the `Size` definition, the test passes.
```
$ python3 test/test_type_hints.py
Fail to import hypothesis in common_utils, tests are not derandomized
.b''
.
----------------------------------------------------------------------
Ran 2 tests in 19.382s

OK
```
I will do that once i get approval from zou3519. This is an initial implementation, please provide your suggestions.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34595

Differential Revision: D20441817

Pulled By: zou3519

fbshipit-source-id: 00a434adf5bca813960f4efea38aa6d6953fe85f
2020-03-13 15:16:24 -07:00
c9ed111894 [caffe2][quantization] Add initializer and precision as read-only property to QueryTensorQparam (#34706)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34706

as title

Test Plan: test in stacked diff

Reviewed By: csummersea

Differential Revision: D20436618

fbshipit-source-id: e51ef0a22708425cd296c05f4089fe8c98eda90a
2020-03-13 15:09:35 -07:00
c371c3aba7 [rpc][profiler] add a test case to verify record_function context manager works (#34511)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34511

With https://github.com/pytorch/pytorch/pull/34122/files, issues
with using record_function context manager and profiling RPCs were fixed. This
adds a test case to verify that we can use RPC with the `record_function`
decorator.
ghstack-source-id: 100109932

Test Plan: Unit test change

Differential Revision: D20352242

fbshipit-source-id: d6429e4352ad3b8d874dc0f27b23ecb6202e6b2b
2020-03-13 15:03:30 -07:00
0f3b6f3dec Add min function to cuda math compat (#34723)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34723

Add min function to cuda math compat

Test Plan: unittest

Reviewed By: houseroad

Differential Revision: D20444517

fbshipit-source-id: 1a93343cc57249ef1101eeb7ef373266f6a2873a
2020-03-13 14:31:09 -07:00
a730abd997 [PyTorch][tools] Add linux64 clang-format hash
Summary:
This commit adds a reference hash for the linux64 clang-format binary and in
doing so, enables this script to be used on Linux machines.

Test Plan:
Ran the script.

```
meghanl@devvm1517:caffe2  (ff25240c|remote/master)$ export http_proxy=fwdproxy:8080
meghanl@devvm1517:caffe2  (ff25240c|remote/master)$ export https_proxy=fwdproxy:8080
meghanl@devvm1517:caffe2  (ff25240c|remote/master)$ python3 ./tools/clang_format_new.py --diff
Downloading clang-format to /data/users/meghanl/fbsource/fbcode/caffe2/.clang-format-bin
0% |################################################################| 100%
Using clang-format located at /data/users/meghanl/fbsource/fbcode/caffe2/.clang-format-bin/clang-format
meghanl@devvm1517:caffe2  (ff25240c|remote/master)$ echo $?
1
```
A non-zero return code indicates that `clang-format` will make changes.

Reviewed By: suo

Differential Revision: D20434291

fbshipit-source-id: fa13766e9d94720d4b0d8a540d2f1507e788f7a5
2020-03-13 14:22:17 -07:00
f933fa3613 [docs][1.5] update RPC docs to reflect correct use of dist_autograd backwards and dist_optim step() (#34670)
Summary:
- Clarify that `torch.distributed.autograd.backwards()` does not use the current thread local autograd context, instead it looks it up based on the context_id passed in
- Clarify the same for `torch.distributeed.optimizer.optim.step()`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34670

Differential Revision: D20427645

Pulled By: rohan-varma

fbshipit-source-id: a1a88de346cdd4dbe65fb2b7627157f86fd2b6a3
2020-03-13 14:09:23 -07:00
c9023e3b12 Support left and right shift operators in JIT (#34563)
Summary:
With this PR, we can now support left and right shift operators in the JIT engine for <int, int> and <Tensor, int>.

Updated tests pass as expected:
```
> python test/test_jit.py
...
Ran 2427 tests in 84.861s

OK (skipped=139, expected failures=1)
```

Running the following code with Python results in the output below:
```
> cat ~/expressions.py
import torch

torch.jit.script
def fn(a, b):
    # type: (int, int)
    return (
        a << b,  # supported
        b >> a,  # supported
        a & b,
        a | b,
        a ^ b
    )
print(fn.graph)
```

```
> python ~/expressions.py
graph(%a.1 : int,
      %b.1 : int):
  %4 : int = aten::leftshift(%a.1, %b.1) # /home/ince/expressions.py:7:8
  %7 : int = aten::rightshift(%b.1, %a.1) # /home/ince/expressions.py:8:8
  %10 : int = aten::__and__(%a.1, %b.1) # /home/ince/expressions.py:9:8
  %13 : int = aten::__or__(%a.1, %b.1) # /home/ince/expressions.py:10:8
  %16 : int = aten::__xor__(%a.1, %b.1) # /home/ince/expressions.py:11:8
  %17 : (int, int, int, int, int) = prim::TupleConstruct(%4, %7, %10, %13, %16)
  return (%17)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34563

Differential Revision: D20434209

Pulled By: tugrulince

fbshipit-source-id: 886386c59755106e17b84778b8e495b80a6269cd
2020-03-13 13:00:33 -07:00
c34ee4fb6e [JIT] disable test (#34722)
Summary:
I opened https://github.com/pytorch/pytorch/issues/34658 but it didn't work.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34722

Differential Revision: D20444547

Pulled By: eellison

fbshipit-source-id: 90aa06098587b48c9760a9c6df9bec01d642fcdb
2020-03-13 12:48:27 -07:00
027d7f7ba5 Delete AT_WARN and replace all AT_WARN with TORCH_WARN (#34623)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34623

The bandaid of "AT_WARN" keeps introducing new warnings. Let's get rid
of it entirely.

Close #34502

Test Plan: Imported from OSS

Differential Revision: D20420112

Pulled By: albanD

fbshipit-source-id: 7160c113cb4deb2d2f50a375356f423fe5e86f50
2020-03-13 12:27:22 -07:00
4a599f47fb scripts: Add script to promote conda packages (#34659)
Summary:
How this actually works:
  1. Get's a list of URLs from anaconda for pkgs to download, most
  likely from pytorch-test
  2. Download all of those packages locally in a temp directory
  3. Upload all of those packages, with a dry run upload by default

This, along with https://github.com/pytorch/pytorch/issues/34500 basically completes the scripting work for the eventual promotion pipeline.

Currently testing with:
```
TEST_WITHOUT_GIT_TAG=1 TEST_PYTORCH_PROMOTE_VERSION=1.4.0 PYTORCH_CONDA_FROM=pytorch scripts/release/promote/conda_to_conda.sh
```

Signed-off-by: Eli Uriegas <eliuriegas@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34659

Differential Revision: D20432687

Pulled By: seemethere

fbshipit-source-id: c2a99f6cbc6a7448e83e666cde11d6875aeb878e
2020-03-13 12:14:58 -07:00
b1dbe33056 Skip TestNN.test_spectral_norm_load_state_ if PyTorch is compiled w… (#34686)
Summary:
…ithout lapack

LAPACK is needed for `at::svd``, which is called from `pinverse()`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34686

Test Plan: CI + local run

Differential Revision: D20442637

Pulled By: malfet

fbshipit-source-id: b3531ecc1197b0745ddcf50febb7fb4a7700d612
2020-03-13 11:36:33 -07:00
40eff454ce Fix max_pool2d NHWC for large tensors; fix incorrect use of cudaGetLastError() (#34519)
Summary:
This PR would fix https://github.com/pytorch/pytorch/issues/33988 and fix https://github.com/pytorch/pytorch/issues/34083.

Previously, the max_pool2d_nhwc kernels used a shared memory with size proportional to the tensor size (c \* h \* w). When the tensor size is too large, the kernel launch fails.

This PR follows the guidance in AdaptiveAvgPool2d_nhwc by increasing the number of grid_x with split in "C" dimension. With that change, there will be a maximum limit in the shared memory size (which is less than 48 kb) regardless of tensor size.

A benchmark can be found at [here](0b98146089/max-pool2d/max-pool2d.ipynb). TL;DR barely any performance drop is found.

cc csarofeen ptrblck jjsjann123 VitalyFedyunin
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34519

Differential Revision: D20388848

Pulled By: VitalyFedyunin

fbshipit-source-id: 9454f385f9315afaab4a05303305578bbcd80b87
2020-03-13 11:28:49 -07:00
3924c55f4c [C++ API] Update torch::nn functional docs (#34688)
Summary:
- `torch::nn::functional` functions must provide example for how to use the corresponding functional options
- `torch::nn::functional` functions must link to the corresponding functional options
- remove `TORCH_NN_FUNCTIONAL_USE_MODULE_OPTIONS` macro, and put `torch::nn::functional` options docs inside the functional namespace, right above functional declaration
- `torch::nn::functional` options docs should not link back to torch::nn layers. Instead, they should  have links to `torch::nn::functional::xxx`

----

This PR is BC-breaking in the following way:
`TORCH_NN_FUNCTIONAL_USE_MODULE_OPTIONS` macro is removed, and user should explicitly write
```cpp
namespace functional {
using SomeFuncOptions = SomeModuleOptions;
} // namespace functional
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34688

Differential Revision: D20431251

Pulled By: yf225

fbshipit-source-id: 7d4f27dca3aad2a1e523690927d7afb261b9d308
2020-03-13 10:27:28 -07:00
27410318ad [PyTorch][Mobile] Fix the operator latency issue.
Summary: Last diff enabled operator stats for non-production build including AIBench. But the operator latency is off: https://our.intern.facebook.com/intern/aibench/details/414567479798816 as it is representing operator execution end time, and as the threadLocalDebugInfo was not set, the start time is 0. So this diff is fixing it by creating a new ThreadLocalDebugInfo object when op starts to run and store the model information for logging.

Test Plan:
```buck run mode/mac aibench:run_bench_macos -- -b aibench/specifications/models/pytorch/pytext/pytext_mobile_inference.json --platform android --framework pytorch --remote --devices SM-G960F-8.0.0-26```
https://our.intern.facebook.com/intern/aibench/details/922804117425407

```buck run mode/mac aibench:run_bench_macos -- -b aibench/specifications/models/pytorch/fbnet/fbnet_mobile_inference.json --platform android --framework pytorch --remote --devices SM-G960F-8.0.0-26```
https://our.intern.facebook.com/intern/aibench/details/593403202250750

Reviewed By: xta0

Differential Revision: D20436388

fbshipit-source-id: 740bc94c3f51daef6af9b45c1ed7a708f5fc8836
2020-03-13 09:49:54 -07:00
8e8a37d746 Fix bug in baddbmm corner case (#33467) (#33538)
Summary:
Ensure `torch.baddbmm(c, a, b)` returns `beta*c` when `a @ b` has empty inner dimension.

Fixes https://github.com/pytorch/pytorch/issues/33467.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33538

Differential Revision: D20352352

Pulled By: albanD

fbshipit-source-id: a7021c1979f82402ecea4784d6cc39783392ea16
2020-03-13 09:30:20 -07:00
8f854fb9e2 [1/n][multi-tower] add partition info in predictor construction (#34175)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34175

to incorporate PartitionInfo added in D20015493

Test Plan: unit tests

Reviewed By: yinghai

Differential Revision: D20133759

fbshipit-source-id: 130db2d80bca3c05a7ec91292159f857046718e0
2020-03-13 09:23:39 -07:00
14c1ab049d [Codemod][FBSourceGoogleJavaFormatLinter] Daily arc lint --take GOOGLEJAVAFORMAT
Reviewed By: zertosh

Differential Revision: D20415422

fbshipit-source-id: 860f8dd9dce0a2420792bafb7d3e58bd883ab7e4
2020-03-13 06:27:03 -07:00
b93518a662 Revert D20422879: [pytorch][PR] Remove hotpatches that circumvent MAGMA bug
Test Plan: revert-hammer

Differential Revision:
D20422879

Original commit changeset: 8dd7a30b5c31

fbshipit-source-id: a44dda3220d426a92b0e158e9903566be8701374
2020-03-13 06:00:11 -07:00
6791ae51a5 Updating submodules
Summary:
GitHub commits:

e8f09733c7
7e1606a407
674cf41732
e961892c6c
a5dffd2784

Test Plan: n/a

Reviewed By: 2d2d2d2d2d

fbshipit-source-id: eb2e20f65ba40bacbfeb1d0cb54ed373cca564ff
2020-03-13 04:17:59 -07:00
fd35596585 [docs][1.5] Update distributed autograd note (#34657)
Summary:
- Update API calls `backward` and `optim.step` now that we require `context_id`
- Add notes to clarify purpose of distributed autograd context (this was a source of confusion in some feedback)
- Add note that details why optimizer requires context_id
- Clearly specify that we don't have SMART mode yet
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34657

Differential Revision: D20427667

Pulled By: rohan-varma

fbshipit-source-id: 5f8a3539ccf648a78e9e9a0dfdfe389c678b1606
2020-03-12 22:56:32 -07:00
808f84ee35 [Shape Inference] Update shape inference in dper3 backend - C2 part (#34474)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34474

Add InferQuantization - set current_dim_type_ to CONSTANT for quantization ops.

Test Plan: buck test mode/opt-clang caffe2/caffe2/opt:bound_shape_inference_test

Reviewed By: yinghai

Differential Revision: D20332703

fbshipit-source-id: 36fa9bc81ae9f49dd00d8393d99ccce0884542df
2020-03-12 22:20:51 -07:00
ad4bc8c9b8 Best-effort Error Detection for Using Deleted UserRRefs (#34673)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/34673

Test Plan: Imported from OSS

Differential Revision: D20427839

Pulled By: mrshenli

fbshipit-source-id: b1b12ca42a9ed5294806c53fa7d6f54e7dc8b188
2020-03-12 21:39:15 -07:00
f9aa0c870f Use c10::str in py_rref.cpp (#34681)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/34681

Test Plan: Imported from OSS

Differential Revision: D20428827

Pulled By: mrshenli

fbshipit-source-id: 847486b3114f0e9a2ad5f80c5e44db82d977c6a2
2020-03-12 21:39:10 -07:00
673d56c838 Use c10::str in process_group_agent.cpp (#34679)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/34679

Test Plan: Imported from OSS

Differential Revision: D20428467

Pulled By: mrshenli

fbshipit-source-id: 2bfde4e383347c6e709109f074f55b9bc8068a49
2020-03-12 21:38:14 -07:00
e9a660a160 Revert D20354878: [quant][graphmode] Add quantized conv2d-relu fusion pattern
Test Plan: revert-hammer

Differential Revision:
D20354878

Original commit changeset: 2b19797d4b3f

fbshipit-source-id: 18f447074794af0d579e145df02af47d01746921
2020-03-12 21:29:08 -07:00
5d65b5cd01 Add the 3d upsample quantized op for video model (#34594)
Summary:
as title, we are currently missing this 3d op, which is required for video related model.

Performance benchmark:
```
import torch, time

for dtype in [torch.qint8, torch.quint8, torch.qint32]:
    print('****', str(dtype), '*****')
    x = torch.rand(1, 56, 64, 56, 256)

    q_x = torch.quantize_per_tensor(x, 0.5, 1, dtype)
    q_x = q_x.permute([0, 4, 1, 2, 3])

    x = x.permute([0, 4, 1, 2, 3])

    NITER = 100

    s = time.time()
    for i in range(NITER):
        float_out = torch.nn.functional.interpolate(x, size=30, scale_factor=None, mode="nearest", align_corners=None)
    time_per_iter_float = (time.time() - s) / NITER

    s = time.time()
    for i in range(NITER):
        quant_out = torch.nn.functional.interpolate(q_x, size=30, scale_factor=None, mode="nearest", align_corners=None)
    time_per_iter_quant = (time.time() - s) / NITER

    ref_quantized = torch.quantize_per_tensor(float_out, 0.5, 1, dtype)
    torch.testing.assert_allclose(ref_quantized.dequantize(), quant_out.dequantize())

    print('time/iter ms (float)', 'time/iter ms (quant)', 'quant/float', sep='\t')
    print(time_per_iter_float * 1000, time_per_iter_quant * 1000, time_per_iter_quant / time_per_iter_float, sep='\t')

    bytes_float = (x.numel() + float_out.numel()) * x.element_size()
    bytes_quant = (q_x.numel() + quant_out.numel()) * q_x.element_size()

    float_bw_gbps = bytes_float / time_per_iter_float / 1e9
    quant_bw_gbps = bytes_quant / time_per_iter_quant / 1e9

    print('GB/s float', 'GB/s quant', sep='\t')
    print(float_bw_gbps, quant_bw_gbps, sep='\t')
```

```
**** torch.qint8 *****
time/iter ms (float)  time/iter ms (quant)  quant/float
1136.8209528923035  1.294245719909668 0.0011384780660638283
GB/s float  GB/s quant
0.20510608588517917 45.03953391792442
**** torch.quint8 *****
time/iter ms (float)  time/iter ms (quant)  quant/float
827.9890131950378 1.11464262008667  0.0013462046021426
GB/s float  GB/s quant
0.28160868355034036 52.29678369508914
**** torch.qint32 *****
time/iter ms (float)  time/iter ms (quant)  quant/float
834.6958303451538 7.481417655944824 0.008963046638020456
GB/s float  GB/s quant
0.2793459455806586  31.16640544920269
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34594

Differential Revision: D20389106

Pulled By: lly-zero-one

fbshipit-source-id: d3a8c2cac58087d8b29e9cae64822f5b2d4c03ba
2020-03-12 21:06:38 -07:00
d5f8c8f3ba Revert D20121169: [pytorch][PR] ONNX Export Support for CrossEntropyLoss
Test Plan: revert-hammer

Differential Revision:
D20121169

Original commit changeset: 7b56617e8c60

fbshipit-source-id: d7f302d1e54f3c978c3be0a0ad1ee600790a5b27
2020-03-12 20:30:54 -07:00
4ae74b3b25 [DPER3][Shape Inference] Initial Shape Inference in DPER3 frontend (#33607)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/33607

Differential Revision: D20025048

fbshipit-source-id: 8b3a3bcfeb450de4d38c555bf2bb116ddedad3ec
2020-03-12 20:25:50 -07:00
0ff4d37933 [quant][graphmode] Add quantized conv2d-relu fusion pattern (#33279)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33279

att

Test Plan:
python test/test_jit.py

Imported from OSS

Differential Revision: D20354878

fbshipit-source-id: 2b19797d4b3fd96918164a58bfbd768211ad6c6d
2020-03-12 19:49:57 -07:00
44256199a9 [JIT] remove specialized list ops (#34520)
Summary:
Now that lists are no longer specialized, we can register only one operator for list ops that are generic to their element type.
This PR reorgs lists into three sets of ops:
- CREATE_GENERIC_LIST_OPS
- CREATE_SPECIALIZED_LIST_OPS
- CREATE_COMPARATOR_LIST_OPS_SPECIALIZED (we didn't bind certain specialized ops to Tensor)

This is important to land quickly because mobile is finalizing its bytecode soon, after which we could not remove these ops.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34520

Reviewed By: iseeyuan

Differential Revision: D20429775

Pulled By: eellison

fbshipit-source-id: ae6519f9b0f731eaa2bf4ac20736317d0a66b8a0
2020-03-12 17:49:23 -07:00
c78eacb5ee scripts: Add promotion script for s3 to pypi (#34500)
Summary:
Is reliant on scripts for promotion from s3 to s3 to have already run.

A continuation of the work done in https://github.com/pytorch/pytorch/issues/34274

Signed-off-by: Eli Uriegas <eliuriegas@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34500

Test Plan: yeah_sandcastle

Differential Revision: D20389101

Pulled By: seemethere

fbshipit-source-id: 5e5b554cff964630c5414d48be35f14ba6894021
2020-03-12 17:21:23 -07:00
52787388d2 [tools] Add clang_format_new.py to download, verify and run clang-format binary (#34566)
Summary:
**Summary**
This commit adds `tools/clang_format_new.py`, which downloads a platform-appropriate
clang-format binary to a `.gitignored` location, verifies the binary by comparing its
SHA1 hash to a reference hash (also included in this commit), and runs it on all files
matched a specific regex in a list of whitelisted subdirectories of pytorch.

This script will eventually replace `tools/clang_format.py`.

**Testing**
Ran the script.

*No Args*
```
pytorch > ./tools/clang_format.py
Downloading clang-format to /Users/<user>/Desktop/pytorch/.clang-format-bin
0% |################################################################| 100%
Using clang-format located at /Users/<user>/Desktop/pytorch/.clang-format-bin/clang-format
> echo $?
0
> git status
<bunch of files>
```

`--diff` *mode*
```
> ./tools/clang_format.py --diff
Using clang-format located at /Users/<user>/Desktop/pytorch/.clang-format-bin/clang-format
Some files are not formatted correctly
> echo $?
1

<format files using the script>

> ./tools/clang_format.py --diff
Using clang-format located at /Users/<user>/Desktop/pytorch/.clang-format-bin/clang-format
All files are formatted correctly
> echo $?
0
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34566

Differential Revision: D20431290

Pulled By: SplitInfinity

fbshipit-source-id: 3966f769cfb923e58ead9376d85e97127415bdc6
2020-03-12 17:08:54 -07:00
90ca7a1feb [quant][graphmode] Add Finalize function that inlines graph and produce quantized ops (#33927)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/33927

Test Plan:
test will be added in later PRs

Imported from OSS

Differential Revision: D20354879

fbshipit-source-id: 03976f4b86c46dbdc4e45764a1e72f1a3855a404
2020-03-12 14:52:58 -07:00
9f05fc9322 [Aten] First argument of check_names_valid_for() should be an unsigned value (#34158)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/34158

Test Plan: CI

Reviewed By: EscapeZero

Differential Revision: D20232089

fbshipit-source-id: d74b5e36a139998e6967b7b6339001c49d9d58e8
2020-03-12 13:46:37 -07:00
721bd11cc3 [caffe2] Refactor out common util functions from tvm_transformer (#34652)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34652

Split from D20006007 because it needs to synced to open source and also for easy testing & landing.

Test Plan:
```
buck test caffe2/caffe2/fb/tvm:test_tvm_transform
```
CI

Reviewed By: yinghai

Differential Revision: D20414037

fbshipit-source-id: 6e17dd9f8cffe87bc59c6e3cc6fd1f8d8def926b
2020-03-12 13:30:15 -07:00
787c307e63 Revert D20368543: [pytorch][PR] [JIT] remove specialized list ops
Test Plan: revert-hammer

Differential Revision:
D20368543

Original commit changeset: ad0c6d70d2a6

fbshipit-source-id: b8b1a64ac830d5f544567714b940c57274194d3f
2020-03-12 12:55:49 -07:00
8c332ff84f [JIT] EliminateDeadCode shouldn't remove custom operator node that has untracked mutation (#34635)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34635

For custom op, it's removed in EliminateDeadCode IR optimization step, causing wrong training result.

EliminateDeadCode decides to remove it, because it has no output, so output is used. Also, it has no side effect, and has no untracked mutation, which is not true, custom op can have untracked mutation.

The if statement here only allows aten and prim operator to have untracked mutation, which should be removed.
ghstack-source-id: 100001319

Test Plan:
```
buck test mode/dev-nosan //caffe2/torch/fb/distributed/pytorch/tests:test_jit

buck build mode/dev-nosan //caffe2/torch/fb/distributed/pytorch/tests:test_jit \
&& buck-out/gen/caffe2/torch/fb/distributed/pytorch/tests/test_jit\#binary.par -r test_use_dense_adagrad_step
```

Reviewed By: wanchaol

Differential Revision: D7440221

fbshipit-source-id: e424417ab397d90075884c7050c59dfc5c84cf77
2020-03-12 12:37:32 -07:00
fe9b4e3cba [DPER3] Blob Reorder (#33579)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/33579

Differential Revision: D20008865

fbshipit-source-id: f35aded311d9d1d7d438d828ccabd2bab5575e5c
2020-03-12 12:28:12 -07:00
9e6cd98c3f Ensure torch_cuda is linked against on Windows (#34288)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/31611.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34288

Differential Revision: D20314251

Pulled By: seemethere

fbshipit-source-id: 15ab2d4de665d553a1622a2d366148697deb6c02
2020-03-12 12:16:44 -07:00
31cd893899 remove some TH dead code (#34644)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/34644

Test Plan: Imported from OSS

Differential Revision: D20423063

Pulled By: ngimel

fbshipit-source-id: 2783345ea9b3ed65e51a7d0e17cfa29f2c12cc43
2020-03-12 12:10:32 -07:00
cb06cb7b9f Remove hotpatches that circumvent MAGMA bug (#34357)
Summary:
Changelog:
- The magma implementation of small singular square batch matrices had a bug that resulted in nan values in the LU factorization result. This has been fixed in MAGMA 2.5.2. This PR removes the existing patch that was a temporary workaround for this bug.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34357

Test Plan: - Existing tests for det and lu should pass

Differential Revision: D20422879

Pulled By: seemethere

fbshipit-source-id: 8dd7a30b5c31fc5b844e0a11965efd46067e936a
2020-03-12 11:59:23 -07:00
a74fbea345 Continuous bernoulli distribution (take 2) (#34619)
Summary:
We recently had a NeurIPS paper (https://arxiv.org/abs/1907.06845 and https://papers.nips.cc/paper/9484-the-continuous-bernoulli-fixing-a-pervasive-error-in-variational-autoencoders) where we introduce a new [0,1]-supported distribution: the continuous Bernoulli. This pull request implements this distribution in pytorch.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34619

Differential Revision: D20403123

Pulled By: ngimel

fbshipit-source-id: d807c7d0d372c6daf6cb6ef09df178bc7491abb2
2020-03-12 11:53:18 -07:00
944ea4c334 ONNX Export Support for CrossEntropyLoss (#33767)
Summary:
Add ONNX export support for torch.nn.CrossEntropyLoss.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33767

Reviewed By: hl475

Differential Revision: D20121169

Pulled By: houseroad

fbshipit-source-id: 7b56617e8c60617b922949fc8b4ecc626eedf7ed
2020-03-12 11:46:58 -07:00
352e9b11e0 Attempt to resolve inconsistent dll linkage warnings on MSVC (#34639)
Summary:
Continue the work in https://github.com/pytorch/pytorch/pull/19242.
Remove the template declarations that implies different dll linkage.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34639

Differential Revision: D20419400

Pulled By: ezyang

fbshipit-source-id: 5c7c30f0a4c3ba555589629f352ddb1c006c0c54
2020-03-12 11:41:02 -07:00
fff6fe83a7 [pytorch-rpc] WireSerializer should check has_storage() (#34626)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34626

We need to check has_storage() before looking at it in
cloneSparseTensors(), to avoid gratuitously throwing.

Ideally, we'd add a test for this (I wrote one up but had to disable it),
but won't work until JIT Pickler supports sparse tensors.
ghstack-source-id: 100018077

Test Plan: buck test mode/dev-nosan caffe2/torch/fb/distributed/thriftRpcAgent/...

Differential Revision: D20399971

fbshipit-source-id: 5debfa8140eb1f949d37336330223962cc320abc
2020-03-12 11:35:21 -07:00
2f32b92763 [ROCm] Enable BFloat16 type for EmbeddingBag ops et al (#34630)
Summary:
This PR enables bfloat16 type for

- Embedding, Index, Sigmoid Ops used in [DLRM](https://github.com/facebookresearch/dlrm)
- Miscellaneous ops like comparison ops, arange op used in unit tests
- Rename types list with the pattern `*_with_bfloat16` in `test_torch.py` to avoid confusion

iotamudelta ezyang
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34630

Differential Revision: D20405093

Pulled By: ezyang

fbshipit-source-id: aa9538acf81b3a5a9a46ce5014529707fdf25687
2020-03-12 11:30:33 -07:00
1e6c47413a Updating submodules
Summary:
GitHub commits:

87f3feae5a
cd6c8897f5

Test Plan: n/a

Reviewed By: 2d2d2d2d2d

fbshipit-source-id: 0c961541c715da74ae417ad25bf29f48e74e45d1
2020-03-12 11:23:39 -07:00
d81d65b2f7 Add entry for distributed tests to CODEOWNERS. (#34637)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34637

ghstack-source-id: 100003837

Test Plan: waitforbuildbot

Differential Revision: D20404552

fbshipit-source-id: a7f35beb8b78ad25e5cd000cd940dd7e94cc65de
2020-03-12 11:17:51 -07:00
f9f8424386 [JIT] remove specialized list ops (#34520)
Summary:
Now that lists are no longer specialized, we can register only one operator for list ops that are generic to their element type.
This PR reorgs lists into three sets of ops:
- CREATE_GENERIC_LIST_OPS
- CREATE_SPECIALIZED_LIST_OPS
- CREATE_COMPARATOR_LIST_OPS_SPECIALIZED (we didn't bind certain specialized ops to Tensor)

This is important to land quickly because mobile is finalizing its bytecode soon, after which we could not remove these ops.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34520

Differential Revision: D20368543

Pulled By: eellison

fbshipit-source-id: ad0c6d70d2a6be6ff0e948d6786052167fc43e27
2020-03-12 10:48:14 -07:00
3f1ba3c465 Redo of "Add API for listing functions overridable by __torch_function__" (#34240)
Summary:
This is a redo of https://github.com/pytorch/pytorch/pull/33791, which was reverted because it introduced a flaky test. The test was flaky and only flaky on Python3.5 because of dict order randomization.

I've fixed the issue with tests clobbering each other in b539fec and removed the override tests for `torch.nn.functional.tanh` and `torch.nn.functional.sigmoid`, which are deprecated and shouldn't be overridable in e0d7402. I also verified that no more test clobbering is happening.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34240

Differential Revision: D20252442

Pulled By: cpuhrsch

fbshipit-source-id: 069568e342a41c90e1dc76cbf85ba4aed47f24be
2020-03-12 10:33:17 -07:00
4e07c35679 Delete all user forks tracked in RRefContext before graceful shutting down (#31893)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31893

In order to resolve the issue summarized in https://github.com/pytorch/pytorch/issues/31325.

The overal solution is to proactively send out delete fork messages from user nodes, before user nodes detecting rref leaks.

As the first step, we want to have a weak ref tracker to track all user rrefs.
ghstack-source-id: 100023142

Test Plan:
V22 is the version that make User to wait on delete UseerRRef message.

# Unit tests

```
buck test mode/dev-nosan //caffe2/test/distributed/rpc:rpc_fork

buck test mode/dev-nosan //caffe2/test/distributed/rpc:rpc_fork -- test_nested_rref_stress --stress-runs 100

buck build mode/dev-nosan //caffe2/test/distributed/rpc:rpc_fork \
&& buck-out/gen/caffe2/test/distributed/rpc/rpc_fork\#binary.par -r test_nested_rref_stress

buck build mode/dev-nosan //caffe2/test/distributed/rpc:rpc_fork \
&& buck-out/gen/caffe2/test/distributed/rpc/rpc_fork\#binary.par - r test_rref_forward_chain

buck build mode/dev-nosan //caffe2/test/distributed/rpc:rpc_fork \
&& buck-out/gen/caffe2/test/distributed/rpc/rpc_fork\#binary.par -r test_non_garbage_collected_user_rref_due_to_local_circular_dependency
```

Reviewed By: mrshenli

Differential Revision: D19292254

fbshipit-source-id: 92c3e8d0b00f183c5e22f163bdca482cc25a1ce9
2020-03-12 10:23:08 -07:00
dd313f314e Stop creating unnecessary Storage with newWithStorage1d. (#34389)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/34389

Test Plan: Imported from OSS

Differential Revision: D20311060

Pulled By: gchanan

fbshipit-source-id: 6d681e0a78e3ea3982d11cfd2eedca843f48302a
2020-03-12 10:18:28 -07:00
518e9f94c2 Kill newWithStorage. (#34388)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/34388

Test Plan: Imported from OSS

Differential Revision: D20311059

Pulled By: gchanan

fbshipit-source-id: 4619a99c7bea76b54b7938b798eedc5bc2983dd5
2020-03-12 10:18:23 -07:00
9fd08b9c37 Get rid of newWithSize. (#34387)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/34387

Test Plan: Imported from OSS

Differential Revision: D20311058

Pulled By: gchanan

fbshipit-source-id: b62653fd31a181d06aa73cda68abe75614cea0a9
2020-03-12 10:17:15 -07:00
a54416d208 [C++ API] Remove deprecated torch::nn::BatchNorm / FeatureDropout / modules_ordered_dict and torch::nn::init::Nonlinearity / FanMode (#34508)
Summary:
This PR is BC-breaking in the following way:
- The deprecated `torch::nn::BatchNorm` is removed in favor of `torch::nn::BatchNorm{1,2,3}d`
- The deprecated `torch::nn::FeatureDropout` is removed in favor of `torch::nn::Dropout{2,3}d`
- The deprecated `torch::nn::modules_ordered_dict` is removed. User should do `Sequential sequential({{"m1", MyModule(1)}, {"m2", MyModule(2)}})` instead.
- The deprecated `torch::nn::init::Nonlinearity` is removed, in favor of the following enums:
    - `torch::kLinear`
    - `torch::kConv1D`
    - `torch::kConv2D`
    - `torch::kConv3D`
    - `torch::kConvTranspose1D`
    - `torch::kConvTranspose2D`
    - `torch::kConvTranspose3D`
    - `torch::kSigmoid`
    - `torch::kTanh`
    - `torch::kReLU`
    - `torch::kLeakyReLU`
- The deprecated `torch::nn::init::FanMode` is removed, in favor of the following enums:
    - `torch::kFanIn`
    - `torch::kFanOut`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34508

Differential Revision: D20351601

Pulled By: yf225

fbshipit-source-id: cca0cd112f29a31bb023e348ca8f82780e42bea3
2020-03-12 10:09:58 -07:00
e95657b87e [C++ API] AdaptiveLogSoftmaxWithLoss (#29076)
Summary:
Implemented AdaptiveLogSoftmaxWithLoss and some tests for modules. Reference https://github.com/pytorch/pytorch/issues/25883
Pull Request resolved: https://github.com/pytorch/pytorch/pull/29076

Differential Revision: D20404588

Pulled By: yf225

fbshipit-source-id: edbadf432b8173cbcc6caf83c9c03dd92dc31a37
2020-03-12 09:53:58 -07:00
157d2d7825 Fix version check for grad_fn for views (#34145)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34145

This fix the following behavior:
```python
import torch

class MyFn(torch.autograd.Function):
    staticmethod
    def forward(ctx, inp, inplace):
        view = inp.clone()[:3]
        if inplace:
            view += 2
        return view

    staticmethod
    def backward(ctx, grad):
        return grad, None

base = torch.rand(10, requires_grad=True)
foo = MyFn.apply(base, False)

print(foo.grad_fn)
# <torch.autograd.function.MyFnBackward object at 0x7f5fd28c4d18>

foo = MyFn.apply(base, True)

print(foo.grad_fn)
# <AsStridedBackward object at 0x7f601c0c3cf0>
```

Where both should be printing `MyFnBackward`.

Test Plan: Imported from OSS

Differential Revision: D20229907

Pulled By: albanD

fbshipit-source-id: 5ebd315d459023017d51760c5bafe43acd5fc3e2
2020-03-12 09:47:56 -07:00
43c9cc7a9c add quantized ELU activation (#34267)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34267

Adds quantized ELU.

Test Plan:
```
python test/test_quantized.py TestQuantizedOps.test_qelu
```

still need to benchmark, saving that for after the review comments

Imported from OSS

Differential Revision: D20370953

fbshipit-source-id: fe941bf966f72dd9eee2c4b2ef45fe7afb50c866
2020-03-12 09:31:00 -07:00
514cba0661 [JIT] remove builtin interpolate functions (#34514)
Summary:
`torch.nn.functional.interpolate` was written as a builtin op when we scripted the standard library, because it has four possible overloads. As a result, whenever we make a change to `interpolate`, we need to make changes in two places, and it also makes it impossible to optimize the interpolate op. The builtin is tech debt.

I talked with ailzhang, and the symbolic script changes are good to remove (i guess that makes a third place we needed to re-implement interpolate).

I'm trying to get rid of unneccessary builtin operators because we're standardizing mobile bytecode soon, so we should try to get this landed as soon as possible.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34514

Differential Revision: D20391089

Pulled By: eellison

fbshipit-source-id: abc84cdecfac67332bcba6b308fca4db44303121
2020-03-12 09:21:33 -07:00
962e362427 Fix _cat operator (#34591)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/34591

Test Plan: Imported from OSS

Differential Revision: D20388000

Pulled By: VitalyFedyunin

fbshipit-source-id: 8ae7593dbddc1a96a03193a99afc9a4ce46203ad
2020-03-12 09:20:10 -07:00
a22008f91e Prohibit copying autograd engines (#34567)
Summary:
Make sure that there could not be more than one instance of either `torch::autograd::Engine` or `torch::autograd::python::PythonEngine`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34567

Test Plan: CI

Differential Revision: D20390622

Pulled By: malfet

fbshipit-source-id: c90595032afc88f552dee52901361b58b282dc1a
2020-03-12 08:06:53 -07:00
3c76b2aeea Replace THPLayout with at::Layout in Python Argument Parser (#34543) (#34584)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/34584

Test Plan:
```
python setup.py develop
python test/test_torch.py
```
Output:
```
...
Ran 3834 tests in 198.825s

OK (skipped=180)
```

Imported from OSS

Differential Revision: D20403330

fbshipit-source-id: 41474d5e7001db070f98ac8379f909f0ac74deb6
2020-03-12 07:19:00 -07:00
f70945b1c3 fix the quantized batchnorm2d (#34579)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/34579

Differential Revision: D20382783

Pulled By: lly-zero-one

fbshipit-source-id: dadfc4974cb4c808f1eedf8cc4ec52ec8d3ea1b0
2020-03-12 00:48:40 -07:00
c235be42dd [jit] kill script namespace (#34515)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34515

Once upon a time we thought this was necessary. In reality it is not, so
removing it.

For backcompat, our public interface (defined in `api/`) still has
typedefs to the old `script::` names.

There was only one collision: `Pass` as a `Stmt` and `Pass` as a graph
transform. I renamed one of them.

Test Plan: Imported from OSS

Differential Revision: D20353503

Pulled By: suo

fbshipit-source-id: 48bb911ce75120a8c9e0c6fb65262ef775dfba93
2020-03-11 23:32:48 -07:00
cf8b728255 Delete OperatorOptions, absorb AliasAnalysisKind into FunctionSchema. (#34588)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34588

I constructed the patch by deleting OperatorOptions and then rerouting
all queries for AliasAnalysisKind to FunctionSchema.  Some of the
behavior is kind of bogus: we really shouldn't be mutating FunctionSchema
after the fact, but that won't get fixed until we actually switch to
true schema merging.

Reland of https://github.com/pytorch/pytorch/pull/34160

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Differential Revision: D20387079

Pulled By: ezyang

fbshipit-source-id: d189f7a6ad8cd186b88b6fbfa3f189994eea14e8
2020-03-11 20:59:46 -07:00
b039bca4db Fix typo in data.rst (#34624)
Summary:
Fix minor typo
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34624

Differential Revision: D20401946

Pulled By: ngimel

fbshipit-source-id: 0c6a7d838aa15120b3ecb8b9ba4b57550c9bcd32
2020-03-11 19:40:18 -07:00
2fe7fc681d [PT] add macro to expose caffe2 ops to PyTorch mobile (#34578)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34578

Right now C10_EXPORT_CAFFE2_OP_TO_C10_CPU didn't work on mobile since we disabled some code paths. This diff added a new macro to enable these code paths so we can register caffe2 ops in PT mobile.

Test Plan:
verified caffe2 ops are registered in PT mobile
(on the whole stack)

```
_caffe2::BBoxConcatBatchSplits(Tensor[] input_list, Tensor[]? _caffe2_preallocated_outputs=None) -> (Tensor output)
_caffe2::BBoxTransform(Tensor rois, Tensor deltas, Tensor im_info, float[] weights, bool apply_scale, bool rotated, bool angle_bound_on, int angle_bound_lo, int angle_bound_hi, float clip_angle_thresh, bool legacy_plus_one, Tensor[]? _caffe2_preallocated_outputs=None) -> (Tensor output_0, Tensor output_1)
_caffe2::BoxWithNMSLimit(Tensor scores, Tensor boxes, Tensor batch_splits, float score_thresh, float nms, int detections_per_im, bool soft_nms_enabled, str soft_nms_method, float soft_nms_sigma, float soft_nms_min_score_thres, bool rotated, bool cls_agnostic_bbox_reg, bool input_boxes_include_bg_cls, bool output_classes_include_bg_cls, bool legacy_plus_one, Tensor[]? _caffe2_preallocated_outputs=None) -> (Tensor scores, Tensor boxes, Tensor classes, Tensor batch_splits, Tensor keeps, Tensor keeps_size)
_caffe2::GenerateProposals(Tensor scores, Tensor bbox_deltas, Tensor im_info, Tensor anchors, float spatial_scale, int pre_nms_topN, int post_nms_topN, float nms_thresh, float min_size, bool angle_bound_on, int angle_bound_lo, int angle_bound_hi, float clip_angle_thresh, bool legacy_plus_one, Tensor[]? _caffe2_preallocated_outputs=None) -> (Tensor output_0, Tensor output_1)
_caffe2::HeatmapMaxKeypoint(Tensor heatmaps, Tensor bboxes_in, bool should_output_softmax=True, Tensor[]? _caffe2_preallocated_outputs=None) -> (Tensor keypoints)
_caffe2::ResizeNearest(Tensor X, str order, float width_scale, float height_scale, Tensor[]? _caffe2_preallocated_outputs=None) -> (Tensor Y)
_caffe2::RoIAlign(Tensor features, Tensor rois, str order, float spatial_scale, int pooled_h, int pooled_w, int sampling_ratio, bool aligned, Tensor[]? _caffe2_preallocated_outputs=None) -> (Tensor)

Reviewed By: dreiss

Differential Revision: D20128254

fbshipit-source-id: 49a837dddc431eb528b5c72ffdfe0d0131cd10b4
2020-03-11 19:15:14 -07:00
0dc0fffca1 [net_transform] only skip ConstantFill for autogen_grad (#34628)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/34628

Differential Revision: D20370564

fbshipit-source-id: 854c8ab44ba262e5020383447ed6bb629064ec33
2020-03-11 19:09:52 -07:00
86fb522acd Remove cudaMemcpy on full memory overlap (#34548)
Summary:
TensorIterator is already checking partial overlap, so there is no trivial UB, but TensorITerator allows full overlap, and it is not a bad idea to skip the memcpy in such case.

fixes: https://github.com/pytorch/pytorch/issues/34525
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34548

Differential Revision: D20371643

Pulled By: ngimel

fbshipit-source-id: ff9e2e872537010afe040204e008b2499af963ad
2020-03-11 17:36:03 -07:00
adb8e26182 Fix for handling batch size 0. (#34599)
Summary:
Separating this out in a different diff, however since most of the
xnnpack integration is not tested until the PR https://github.com/pytorch/pytorch/issues/34047, this was not
caught till then.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34599

Test Plan: Tested in test/test_xnnpack_integration.py via https://github.com/pytorch/pytorch/issues/34047.

Differential Revision: D20391000

Pulled By: kimishpatel

fbshipit-source-id: 596a3e54445072ab63f700d425d07c7f44586683
2020-03-11 16:36:28 -07:00
9064fafb6e [C++ API] Update torch::nn layer docs (#34522)
Summary:
This PR updates C++ API torch::nn layer docs.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34522

Test Plan: Imported from GitHub, without a `Test Plan:` line.

Differential Revision: D20380832

Pulled By: yf225

fbshipit-source-id: ee99a838ec05c6ce2a23aa97555707e507d09958
2020-03-11 16:09:09 -07:00
56832bf7f3 [JIT] Add support for tolist for GPU-resident Tensors (#34554)
Summary:
**Summary**
This commit modifies the JIT implementation of `Tensor.tolist` so that it
can be called on GPU-resident Tensors as well. If the Tensors is not on the
CPU when the operator is invoked, it is copied to the CPU before doing any
of the rest of the work to convert it into a list.

**Testing**
This commit adds GPU versions of some of the existing CPU tests for this
feature.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34554

Differential Revision: D20392604

Pulled By: SplitInfinity

fbshipit-source-id: 69c17b98d866428c19d683588046169538aaf1e3
2020-03-11 15:14:12 -07:00
866505b100 [ci] try to fix rocm builds (#34600)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34600

They are failing with:
```
E: The method driver /usr/lib/apt/methods/https could not be found.
```

Trying the solution recommended in: https://unix.stackexchange.com/questions/263801/apt-get-fails-the-method-driver-usr-lib-apt-methods-https-could-not-be-found

The long-term solution is to move all this to be pre-installed in the
docker image.

Test Plan: Imported from OSS

Differential Revision: D20391153

Pulled By: suo

fbshipit-source-id: 959dff2ea9e77bb52739c0659e9d800cdbe4cb01
2020-03-11 15:01:12 -07:00
2de4f245c6 Fix typo in documentation (#34581)
Summary:
Update the  parameter description of `total_steps` in `OneCycleLR`. References https://github.com/pytorch/pytorch/issues/34531
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34581

Differential Revision: D20386306

Pulled By: albanD

fbshipit-source-id: f8b424a01760e8f5d4de5367b6c60fb342019689
2020-03-11 13:57:10 -07:00
25e4e9eb86 [On-device Benchmark] speed_benchmark_torch switch to log latency from dataset level to row level (#34598)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34598

as above

Test Plan:
test.txt
```
what time is it now
could you set a reminder at 7 am
waht is the weather today
```
example json
```
{
    "model": {
      "category": "CNN",
      "description": "Assistant Mobile Inference",
      "files": {
        "model": {
          "filename": "model.pt1",
          "location": "//everstore/GICWmAB2Znbi_mAAAB0P51IPW8UrbllgAAAP/model.pt1",
          "md5": "c0f4b29c442bbaeb0007fb0ce513ccb3"
        },
        "data": {
          "filename": "input.txt",
          "location": "/home/pengxia/test/input.txt",
          "md5": "c0f4b29c442bbaeb0007fb0ce513ccb3"
        }
      },
      "format": "pytorch",
      "framework": "pytorch",
      "kind": "deployment",
      "name": "Assistant Mobile Inference"
    },
    "tests": [
      {
        "command": "{program} --model {files.model}  --input_dims \"1\" --input_type NLUType --warmup {warmup} --iter 5 --input_file {files.data} --report_pep true",
        "identifier": "{ID}",
        "metric": "delay",
        "iter": 15,
        "warmup": 2,
        "log_output": true
      }
    ]
  }

```

iter = 5 (--iter 5 ) *3(3 lintes in the test.txt)  = 15

arbabu123 I will provide a wrapper to compute the iter in future.

run following command
```
buck run aibench:run_bench -- -b aibench/specifications/models/pytorch/fbnet/assistant_mobile_inference.json --platform android/full_jit --framework pytorch --remote --devices  SM-G960U-8.0.0-26
```

results
https://our.intern.facebook.com/intern/aibench/details/275259559594003

**Note: this is compatible with the existing examples.**

Reviewed By: kimishpatel, ljk53

Differential Revision: D20389285

fbshipit-source-id: 80165ef394439a307ac7986cf540a80fdf3d85d6
2020-03-11 13:51:42 -07:00
70f3298684 Fix SELECTED_OP_LIST file path issue (#33942)
Summary:
If SELECTED_OP_LIST is specified as a relative path in command line, CMake build will fail.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33942

Differential Revision: D20392797

Pulled By: ljk53

fbshipit-source-id: dffeebc48050970e286cf263bdde8b26d8fe4bce
2020-03-11 13:19:31 -07:00
1f834b5c2a [JIT] Torchbind error if python instantiate class that doesnt exist (#34568)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/34568

Test Plan: Imported from OSS

Differential Revision: D20378106

Pulled By: jamesr66a

fbshipit-source-id: 395a3b05d23727b9cfd074440b2d0e8ef002ec09
2020-03-11 13:13:08 -07:00
12fb8148e4 Disable ROCM when building mobile libtorch. (#34478)
Summary:
When a system has ROCm dev tools installed, `scripts/build_mobile.sh` tried to use it.
This PR fixes looking up unused ROCm library when building libtorch mobile.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34478

Differential Revision: D20388147

Pulled By: ljk53

fbshipit-source-id: b512c38fa2d3cda9ac20fe47bcd67ad87c848857
2020-03-11 11:28:32 -07:00
b553e6911a [distributed] quicker exit in the case of failed tests in distributed (#34150)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34150

In the distributed setting we commonly have tests in which there are errors where one process
exits but the other do not (since they are for example waiting for work from
the process that exited). Currently, when this situation happens we do not
handle this well, and wait for process 0 to timeout. This results in wasted
time waiting for test errors and a less helpful "Process 0 timed out..." error
message when the error was actually something else.

This diff fixes the issue by checking for exited subprocesses and terminating
the test when we see a subprocess that has exited uncleanly. We still enforce
timeouts and return when all processes have exited cleantly in the happy path.
ghstack-source-id: 99921462

Test Plan:
All distributed tests + tested by writing tests that should trigger
the unclean subprocess detection, and verified that we exit quickly instead of
waiting for the entire timeout.

Differential Revision: D20231032

fbshipit-source-id: 3e0d4a20925b7d1098ec4c40ffcc66845425dd62
2020-03-11 11:27:17 -07:00
2cf576e9ea small typos (#34589)
Summary:
Spotted a couple of small typos 🙏
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34589

Differential Revision: D20387653

Pulled By: ngimel

fbshipit-source-id: 3089fe606ccb8c8ee57cf7a900aba714fd0ce567
2020-03-11 11:01:31 -07:00
82cdd3abae Stop last usage of newWithSize. (#34386)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/34386

Test Plan: Imported from OSS

Differential Revision: D20311061

Pulled By: gchanan

fbshipit-source-id: 1e90a90db2efa1a566d4a78a6d1b8d918b91cf66
2020-03-11 09:58:30 -07:00
4b929e5466 Revert D20193196: [pytorch][PR] PCA and SVD for low-rank matrices, LOBPCG for positive-defined generalized eigenvalue problem
Test Plan: revert-hammer

Differential Revision:
D20193196

Original commit changeset: 78a487991242

fbshipit-source-id: 8da4f8cb17c45af41e8c0ce80bc72581eb10dbb8
2020-03-11 09:24:34 -07:00
6f8a8e4e47 Revert D20282846: Delete OperatorOptions, absorb AliasAnalysisKind into FunctionSchema.
Test Plan: revert-hammer

Differential Revision:
D20282846

Original commit changeset: ba7bca6e8adc

fbshipit-source-id: b9e15d2b2c3d1dbc6e971ab3c0bdf380e769dcf1
2020-03-11 07:50:29 -07:00
63964175b5 Revert D20379910: [pytorch][PR] Set USE_RCCL cmake option (dependent on USE_NCCL)
Test Plan: revert-hammer

Differential Revision:
D20379910

Original commit changeset: 981f924be93d

fbshipit-source-id: 2cfc2eebe6ebabf801f0ea6a183aad2342ada79f
2020-03-11 07:41:13 -07:00
2ec779d46c PCA and SVD for low-rank matrices, LOBPCG for positive-defined generalized eigenvalue problem (#29488)
Summary:
This PR implements the following linear algebra algorithms for low-rank matrices:
- [x] Approximate `A` as `Q Q^H A` - using Algorithm 4.4 from [Halko et al, 2009](http://arxiv.org/abs/0909.4061).
  + exposed as `torch.lowrank.get_approximate_basis(A, q, niter=2, M=None) -> Q`
  + [x] dense matrices
  + [x] batches of dense matrices
  + [x] sparse matrices
  + [x] documentation
- [x] SVD - using Algorithm 5.1 from [Halko et al, 2009](http://arxiv.org/abs/0909.4061).
  + uses `torch.lowrank.get_approximate_basis`
  + exposed as `torch.svd_lowrank(A, q=6, niter=2, M=None) -> (U, S, V)`
  + [x] dense matrices
  + [x] batches of dense matrices
  + [x] sparse matrices
  + [x] documentation
- [x] PCA - using `torch.svd_lowrank`
  + uses `torch.svd_lowrank`
  + exposed as `torch.pca_lowrank(A, center=True, q=None, niter=2) -> (U, S, V)`
  + [x] dense matrices
  + [x] batches of dense matrices
  + [x] sparse matrices, uses non-centered sparse matrix algorithm
  + [x] documentation
- [x] generalized eigenvalue solver using the original LOBPCG algorithm [Knyazev, 2001](https://epubs.siam.org/doi/abs/10.1137/S1064827500366124)
  + exposed as `torch.lobpcg(A, B=None, k=1, method="basic", ...)`
  + [x] dense matrices
  + [x] batches of dense matrices
  + [x] sparse matrices
  + [x] documentation
- [x] generalized eigenvalue solver using robust LOBPCG with orthogonal basis selection [Stathopoulos, 2002](https://epubs.siam.org/doi/10.1137/S1064827500370883)
  + exposed as `torch.lobpcg(A, B=None, k=1, method="ortho", ...)`
  + [x] dense matrices
  + [x] batches of dense matrices
  + [x] sparse matrices
  + [x] documentation
- [x] generalized eigenvalue solver using the robust and efficient LOBPCG Algorithm 8 from [Duersch et al, 2018](https://epubs.siam.org/doi/abs/10.1137/17M1129830) that switches to orthogonal basis selection automatically
  + the "ortho" method improves iterations so rapidly that in the current test cases it does not make sense to use the basic iterations at all. If users will have matrices for which basic iterations could improve convergence then the `tracker` argument allows breaking the iteration process at user choice so that the user can switch to the orthogonal basis selection if needed. In conclusion, there is no need to implement Algorithm 8 at this point.
- [x] benchmarks
  + [x] `torch.svd` vs `torch.svd_lowrank`, see notebook [Low-rank SVD](https://github.com/Quansight/pearu-sandbox/blob/master/pytorch/Low-rank%20SVD.ipynb). In conclusion, the low-rank SVD is going to be useful only for large sparse matrices where the full-rank SVD will fail due to memory limitations.
  + [x] `torch.lobpcg` vs `scipy.sparse.linalg.lobpcg`, see notebook [LOBPCG - pytorch vs scipy](https://github.com/Quansight/pearu-sandbox/blob/master/pytorch/LOBPCG%20-%20pytorch%20vs%20scipy.ipynb). In conculsion, both implementations give the same results (up to numerical errors from different methods), scipy lobpcg implementation is generally faster.
  + [x] On very small tolerance cases, `torch.lobpcg` is more robust than `scipy.sparse.linalg.lobpcg` (see `test_lobpcg_scipy` results)

Resolves https://github.com/pytorch/pytorch/issues/8049.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/29488

Differential Revision: D20193196

Pulled By: vincentqb

fbshipit-source-id: 78a4879912424595e6ea95a95e483a37487a907e
2020-03-11 07:33:49 -07:00
5fc5cf6571 Stop using ctypes to interface with CUDA libraries. (#33678)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/33016, Continuation of https://github.com/pytorch/pytorch/issues/31160
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33678

Differential Revision: D20249187

Pulled By: ezyang

fbshipit-source-id: 172ce4a0fee7fbe01436a421d1af22ef6173b6ed
2020-03-11 07:22:46 -07:00
9d42177a31 Delete OperatorOptions, absorb AliasAnalysisKind into FunctionSchema. (#34160)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34160

I constructed the patch by deleting OperatorOptions and then rerouting
all queries for AliasAnalysisKind to FunctionSchema.  Some of the
behavior is kind of bogus: we really shouldn't be mutating FunctionSchema
after the fact, but that won't get fixed until we actually switch to
true schema merging.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Differential Revision: D20282846

Pulled By: ezyang

fbshipit-source-id: ba7bca6e8adc3365789639b88e54c4e881b1692e
2020-03-11 07:15:18 -07:00
b2344b70da Beef up documentation on Dispatcher.h, reorder methods for clarity. (#33838)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33838

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Differential Revision: D20227875

Pulled By: ezyang

fbshipit-source-id: 319855b1f0fa436f9ed5256d2106b07f20e6b833
2020-03-11 07:13:39 -07:00
fbbeee0983 Port remainder from TH to ATen (CPU and CUDA) (#34136)
Summary:
CPU issue https://github.com/pytorch/pytorch/issues/24753
CUDA issue https://github.com/pytorch/pytorch/issues/24615
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34136

Differential Revision: D20375458

Pulled By: ezyang

fbshipit-source-id: 1a9fb39a7e2d17a0d31bd14b211eaacea060e834
2020-03-11 07:08:11 -07:00
7aca9afdfb [pytorch] remove boilerplate setQEngine() from PyTorch mobile predictors (#34556)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34556

According to
https://github.com/pytorch/pytorch/pull/34012#discussion_r388581548,
this `at::globalContext().setQEngine(at::QEngine::QNNPACK);` call isn't
really necessary for mobile.

In Context.cpp it selects the last available QEngine if the engine isn't
set explicitly. For OSS mobile prebuild it should only include QNNPACK
engine so the default behavior should already be desired behavior.

It makes difference only when USE_FBGEMM is set - but it should be off
for both OSS mobile build and internal mobile build.

Test Plan: Imported from OSS

Differential Revision: D20374522

Pulled By: ljk53

fbshipit-source-id: d4e437a03c6d4f939edccb5c84f02609633a0698
2020-03-11 00:55:14 -07:00
2ce9513b0c AccumulateGrad: ensure sparse tensor indices and values refcount is always 1 (#34559)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34559

We check the use_count for indices and values when we avoid a clone
for sparse tensors. The sparse tensor grad itself might have a higher refcount
due to DDP hooks/dist autograd structures holding refs, but the indices and
values inside the sparse tensor should always have a refcount of 1.
ghstack-source-id: 99900534

Test Plan: waitforbuildbot

Differential Revision: D20375239

fbshipit-source-id: 6a654549d13071ab3451cef94259caf7627b575c
2020-03-10 23:41:44 -07:00
ab2297dfe6 Add Tensor overload for start in narrow. (#34317)
Summary:
https://github.com/pytorch/pytorch/issues/31558
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34317

Differential Revision: D20294333

Pulled By: ailzhang

fbshipit-source-id: 47c6646ae298e04a455923bd5048db026a5e3c7c
2020-03-10 22:33:22 -07:00
2e88a78d2e add quantized_hardtanh (#34097)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34097

Adds quantized hardtanh.  Calls the clamp kernel behind the
scenes.

Test Plan:
```
python test/test_quantized.py
```

Imported from OSS

Differential Revision: D20208860

fbshipit-source-id: 165a6a1c22f1dcc479679e5ea0c990d0e9c3b6c5
2020-03-10 22:27:15 -07:00
8d84c5f1c7 Fix static data initialization deadlock on GIL (#34505)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34505

A thread could hold GIL when calling PythonRpcHandler::getInstance(),
meantime another thread could have been doing static data
initialization by calling `new PythonRpcHandler()`, inside of which GIL is
also required. Static data initialization is thread-safe, so the thread
holding the GIL will wait for the other thread to finish static data
initializating before going forward. Because the initialization can't
proceed without GIL, there is a deadlock. We ask the calling thread to
release GIL to avoid this situation.
ghstack-source-id: 99893858

Test Plan:
```
buck test mode/dev-nosan //caffe2/test/distributed/rpc:dist_autograd_spawn -- 'test_backward_simple_script_call \(test_dist_autograd_spawn\.DistAutogradTestWithSpawn\)' --stress-runs 100
```

Differential Revision: D7490489

fbshipit-source-id: 76f63cc7bedf088d3dbff288f53aa0bd33749255
2020-03-10 20:40:22 -07:00
ce77d4a316 Set USE_RCCL cmake option (dependent on USE_NCCL) (#31341)
Summary:
so that Gloo build has RCCL path enabled for ROCm
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31341

Differential Revision: D20379910

Pulled By: ezyang

fbshipit-source-id: 981f924be93ddcc0705c1934f92d938c29aaf312
2020-03-10 20:26:09 -07:00
23b2fba79a [jit] Add type tags to lists/dicts in pickle (#33255)
Summary:
Stacked PRs
 * #33474 - [jit] Remove list specializations from pickler
 * **#33255 - [jit] Add type tags to lists/dicts in pickle**

This adds a global call to `torch.jit._pickle.restore_type_tags` for
lists and dicts so that we can preserve their types after serialization.
](https://our.intern.facebook.com/intern/diff/20346780/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33255

Pulled By: driazati

Differential Revision: D20346780

fbshipit-source-id: c8534954ef4adb2e3c880401acbee30cd284f3db
2020-03-10 19:17:01 -07:00
4167db11f7 [pytorch][ci] add build_only flag to mobile CI jobs (#34560)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34560

These jobs don't have next phase so we don't really need commit the
docker images.
Should also fix issue #34557.

Test Plan: Imported from OSS

Differential Revision: D20375308

Pulled By: ljk53

fbshipit-source-id: 328cb428fcfb0fbb79b2a233b5f52607158c983c
2020-03-10 17:45:51 -07:00
a09c4d3997 [pt][quant] Vectorized qmul and more methods on qint data types (#34376)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34376

Vectorized implementation of qmul. qmul is now ~16x faster on my development machine. This implementation works for qint8, quint8 and qint32. Also added some commonly used operations, such as multiply operator, requantize operation etc., to qint vector classes for future use.

```
#!/usr/bin/env python

import time
import torch
import torch.nn as nn
torch.set_num_threads(1)
# print(torch.__config__.parallel_info())

A = torch.rand(1, 54, 54, 256)
B = torch.rand(1, 54, 54, 256)

scale = .05
zero_point = 50

for dtype in [torch.quint8, torch.qint8]:

    qA = torch.quantize_per_tensor(A, scale=scale, zero_point=zero_point,
            dtype=dtype)
    qB = torch.quantize_per_tensor(B, scale=scale, zero_point=zero_point,
            dtype=dtype)

    NITER = 1000
    s = time.time()
    for i in range(NITER):
        out = torch.ops.quantized.mul(qA, qB, scale=scale, zero_point=zero_point)
    time_per_iter = (time.time() - s) / NITER

    print('dtype: {} time per iter ms: {:.3f}'.format(dtype, time_per_iter * 1000))
```
### Before
dtype: torch.quint8 time per iter ms: 6.714
dtype: torch.qint8 time per iter ms: 6.780

### After
dtype: torch.quint8 time per iter ms: 0.431
dtype: torch.qint8 time per iter ms: 0.417

### Test
Modified qmul tests to include qint8 and qint32 data types.

python test/test_quantized.py TestQuantizedOps.test_qmul_relu_same_qparams
python test/test_quantized.py TestQuantizedOps.test_qmul_relu_different_qparams
python test/test_quantized.py TestQuantizedOps.test_qmul_broadcast
ghstack-source-id: 99862681

Differential Revision: D20308515

fbshipit-source-id: 4fa65b2ba433cfd59260fc183a70f53a6fcc36b4
2020-03-10 16:51:41 -07:00
903ad90325 [JIT] Introduce a fake Tensor creation node for IR unit tests (#34334)
Summary:
**Summary**
There is often a need to create a Tensor when writing IR by hand for JIT
optimisation pass unit tests. The only options for this today are real
Tensor creation functions like `aten::ones`. Any test that uses these functions
must also use the same default arguments as the Python/C++ API, which means
that all of the tests have to be updated when the API is updated. This commit
introduces a new primitive, `prim::MakeTestTensor` with schema `() -> Tensor` that
should be used in unit tests instead of real Tensor creation functions. This new
primitive has no public-facing API, so the maintenance burden is much lower.

**Testing**
This commit updates the alias analysis and DCE tests to use `prim::MakeTestTensor` instead of
`aten::rand`, `aten::ones`, and `aten::zeros`.

```
$ ./bin/test_jit
CUDA not available. Disabling CUDA and MultiCUDA tests
Note: Google Test filter = *-*_CUDA:*_MultiCUDA
[==========] Running 75 tests from 1 test case.
[----------] Global test environment set-up.
[----------] 75 tests from JitTest
[ RUN      ] JitTest.ADFormulas
[       OK ] JitTest.ADFormulas (82 ms)
[ RUN      ] JitTest.Attributes
[       OK ] JitTest.Attributes (0 ms)
...
...
...
[ RUN      ] JitTest.LiteInterpreterPrim
[       OK ] JitTest.LiteInterpreterPrim (0 ms)
[ RUN      ] JitTest.LiteInterpreterLoadOrigJit
[       OK ] JitTest.LiteInterpreterLoadOrigJit (2 ms)
[----------] 75 tests from JitTest (150 ms total)

[----------] Global test environment tear-down
[==========] 75 tests from 1 test case ran. (150 ms total)
[  PASSED  ] 75 tests.
```

**Fixes**
This pull request fixes https://github.com/pytorch/pytorch/issues/33500.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34334

Differential Revision: D20296437

Pulled By: SplitInfinity

fbshipit-source-id: df4e7b0881ae4913424e5a409bfa171a61c3e568
2020-03-10 16:12:45 -07:00
d0834c5b64 Preserve memory format for torch.cat on CUDA (#34526)
Summary:
fix https://github.com/pytorch/pytorch/issues/34084

cc: ptrblck VitalyFedyunin
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34526

Differential Revision: D20371847

Pulled By: ngimel

fbshipit-source-id: e3b1a34caff2db8099ad9afe91bf9b473d5da6e8
2020-03-10 16:06:10 -07:00
be3bc1deb1 convert counter back to list #33229 (#33356)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/33229
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33356

Differential Revision: D20003196

Pulled By: vincentqb

fbshipit-source-id: 96f9e0fc7e99a7c2e202f932d1a2ffa158afad92
2020-03-10 15:46:24 -07:00
dd7cec680c Do not use clang if it can not parse system extensions (#34549)
Summary:
Attempt to build pytorch with ASAN on system with gcc-8 fails due to the mismatch system compilation flags.
Address the issue by using original compiler to build `torch._C` extension
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34549

Test Plan: Run `.jenkins/pytorch/build-asan.sh` on FC-30

Differential Revision: D20373781

Pulled By: malfet

fbshipit-source-id: 041c8d25f96b4436385a5e0eb6fc46e9b5fdf3f1
2020-03-10 15:40:08 -07:00
09296c34a4 Add the build for runtime dispatch for AVX, AVX2 instruction set (#26125)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/26125

We already had some optimization implementation using AVX2 for improve the quantized kernel performance. In this diff, we want to enable the runtime dispatch.

Test Plan:
Sandcastle build and test

Also test with a python binary calling into vectorized op.

torch.__config__.show()
PyTorch built with:
  - GCC 4.2
  - clang 8.0.20181009
  - Intel(R) Math Kernel Library Version 2017.0.3 Product Build 20170413 for Intel(R) 64 architecture applications
  - Intel(R) MKL-DNN v0.18.1 (Git Hash N/A)
  - OpenMP 1
  - **CPU capability usage: AVX2**
  - Build settings:

Reviewed By: jamesr66a

Differential Revision: D17337251

fbshipit-source-id: 8e22d10011a12a4eaf54cea3485353eb1811d828
2020-03-10 15:32:57 -07:00
259d7299db [caffe2] do not declare __assert_fail in clang builds (#33893)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33893

It appears that when Clang drives CUDA compilation ` __assert_fail` is always defined as device function.

Test Plan:
```lang=bash
buck build mode/opt -c fbcode.cuda_use_clang=true -c cxx.untracked_headers=ignore //fblearner/flow/projects/dper:workflow
```

Reviewed By: ngimel

Differential Revision: D20145034

fbshipit-source-id: 23153411ed631e05421c7afcf41b7ea5619cdd96
2020-03-10 14:45:03 -07:00
2d24005d18 [C++ API Parity] rmsprop optimizer update (#33450)
Summary:
**This PR is BC-breaking in the following way:**

In RMSpropOptions:
1. learning_rate is renamed to lr.

**Test plan before 1.5 release:**

Test that in 1.5 we can load a C++ RMSprop optimizer that was serialized in 1.4, and their states are the same.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33450

Differential Revision: D20366623

Pulled By: anjali411

fbshipit-source-id: 83250be9b583a766927e0e22a4de8b0765379451
2020-03-10 13:30:56 -07:00
6f12145c60 Change std::to_string call to c10::to_string
Summary: I'm using this code in an internal Android build, and std::to_string doesn't work in our internal Android builds yet.

Test Plan: Internal build.

Reviewed By: ljk53

Differential Revision: D20234221

fbshipit-source-id: 8fd61235bf9b487e07a1459c452830e732c7afb0
2020-03-10 13:18:27 -07:00
2cf344be4c Turn on exact_dtype by default on test_sparse.py (#34489) (#34542)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34542

Turn on exact_dtype by default on test_sparse.py (#34489)

Pull Request resolved: #34489

Test Plan:
```
python test/test_sparse.py
```

Imported from OSS

Differential Revision: D20369764

fbshipit-source-id: ade2434f77af8ae419bda653b4c46616c052a8b2
2020-03-10 12:52:09 -07:00
b185359fb4 Avoid clone for sparse tensors during accumulation of grads. (#33427)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33427

This PR is an attempt to avoid clone for sparse tensors similar to how
we avoid clone for dense tensors currently.

As per my understanding even if the 'indices' and 'values' of a sparse tensor
are non-continguous, operations like 'add' are still supported. As a result,
the major change in this PR is to use create a shallow copy instead of clone()
for sparse tensors.
ghstack-source-id: 99838375

Test Plan: waitforbuildbot

Differential Revision: D19926698

fbshipit-source-id: b5a3f36c2aa273e17f8b7a9f09c1ea00e7478109
2020-03-10 12:41:47 -07:00
5f61f42c79 .circleci: Switch should_run_job cuda 10.1 -> 10.2 (#34498)
Summary:
We updated the default jobs to run in a different PR but neglected to
update this script as well.

Signed-off-by: Eli Uriegas <eliuriegas@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34498

Differential Revision: D20368420

Pulled By: seemethere

fbshipit-source-id: 240171b18f397095e3a8d57de3a29d1d2e891d85
2020-03-10 12:25:09 -07:00
cd9d9a2235 fix handling of replica parameters in DataParallel (#33907)
Summary:
In DataParallel, replica parameters are not leaves (because they are computed via broadcast from master parameters), and should be treated as such. Fixes https://github.com/pytorch/pytorch/issues/33552
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33907

Differential Revision: D20150199

Pulled By: ngimel

fbshipit-source-id: 5965d4115b6b3a8433063126ff6269567872fbeb
2020-03-10 10:35:44 -07:00
0dbfb26e53 Clean up include list of Shape.cu (#34528)
Summary:
The include list seems to be copied from somewhere else, and some totally unrelated files are included.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34528

Differential Revision: D20358622

Pulled By: ngimel

fbshipit-source-id: d8a6260f5f77b0eabdbd68e3728873efd632d9bc
2020-03-10 10:29:20 -07:00
cb689a5d68 remove duplicated process group gloo timeout (#31342)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/31342

Test Plan: unit test

Differential Revision: D19131704

fbshipit-source-id: 4e91d5933635ee2c7c301caf89a5a7009c5cb7c8
2020-03-10 09:08:02 -07:00
c7dd5f89a2 Fix #33562 (uncaught domain_error on macOS) (#34301)
Summary:
Tries to fix https://github.com/pytorch/pytorch/issues/33562 by raising `std::runtime_error` instead of `std::domain_error`.
* The Python tests already expect `RuntimeError` so this shouldn't affect Python users of PyTorch.
* If someone out there is using C10 or ATen from C++ and tries to catch `std::domain_error` specifically, this fix would break their code. Hopefully that's not the case.

Alternative to this PR is someone try to really get to the bottom of why `std::domain_error` isn't being caught.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34301

Differential Revision: D20344579

Pulled By: ezyang

fbshipit-source-id: d5f3045085a2f75b71b864335ebf44991d0cad80
2020-03-10 08:56:38 -07:00
9e94e46453 Check if rnn weights need to be flattened (#34265)
Summary:
cuDNN needs it, MIOpen doesn't. However, since it seems to be the PyTorch preference to not introduce ROCm-specific logic in the python layer, we need to add a C++ function to detect if rnn weight flattening is needed.

This PR will be needed to fix the rnn unit test errors arising for PR https://github.com/pytorch/pytorch/issues/33837.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34265

Differential Revision: D20345105

Pulled By: ezyang

fbshipit-source-id: a2588a6e2ac6f7d1edf2b7872bc6a879a7df96ec
2020-03-10 08:45:29 -07:00
29b673392f [ROCm] Enable BFloat16 type for loss functions and few misc ops required for resnet50 (#34469)
Summary:
This PR enables bfloat16 type for loss criterion ops(and the ops they depend on) and few miscellaneous ops required to train resnet50.

iotamudelta ezyang
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34469

Differential Revision: D20348856

Pulled By: ezyang

fbshipit-source-id: 0a8f06c2169cfa3c9cf319120e27150170095f6c
2020-03-10 08:39:07 -07:00
20b18a58f1 Update compiler warning about ABI compatibility (#34472)
Summary:
3ac42677633a39c588c3fea19d2d4121f114edb3 already forces pytorch to use gcc>=5 everywhere
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34472

Differential Revision: D20345134

Pulled By: ezyang

fbshipit-source-id: 3ce706405e8784cac5c314500466b5f988ad31bf
2020-03-10 08:12:07 -07:00
f5ee46f1cf Remove custom function in no_grad block error message (#33896)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33896

Fixes #32625. Previously, we'd receive an error message if we have a
custom function return a view of an input in a no_grad block:
```
class Alias(Function):
    staticmethod
    def forward(ctx, x):
        return x[:]

    staticmethod
    def backward(ctx, gx):
        return gx

inp = torch.rand(2, requires_grad=True)

with torch.no_grad():
    # Used to error out
    output = Alias.apply(inp)
```

After this change, the error no longer happens. The behavior changes to
become consistent to if we had implemented an operator that does the
same thing as the custom function:
- the output requires_grad
- we are able to detect (and error out) if the user tries to modify the
output in-place outside of the no_grad block.

Test Plan: - new test

Differential Revision: D20345601

Pulled By: zou3519

fbshipit-source-id: 7f95b4254f52ddbf989d26f449660403bcde1c78
2020-03-10 07:58:55 -07:00
3e6e2e9b7b Print the current Node name in anomaly mode (#33875)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33875

Fixes #33675.

I added a `current_node_name` argument to AnomalyMetadata::print_stack.
This is a mandatory arg because I found only one callsite and making it
a default arg on a virtual function can be confusing.

Test Plan:
- Tested locally:
https://gist.github.com/zou3519/09937387c83efc76e1700374d5c9c9d9
- I don't know how to add a test for this: the message is printed to
stderr but it isn't an exception nor a warning. I considered capturing
the stderr of a subprocess but that seems like asking for flakiness.

Differential Revision: D20349399

Pulled By: zou3519

fbshipit-source-id: 7585ddffe2bf9e1081f4028a9c44de783978a052
2020-03-10 07:51:52 -07:00
d30fa4837e Unify gradient accumulation between distributed autograd and local autograd (#33214)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33214

Distributed autograd had some custom logic in terms of how we
accumulated gradients. This was mostly done early on to enable basic
functionality. Although, in the long term we should merge this logic with what
we have in the local autograd engine. A lot of work has gone into ensuring we
accumulate grads correctly and efficiently and we should reuse that as a
starting point.

We can investigate if we need further custom logic for distributed autograd
later on if we need additional optimizations.

In this PR I've merged the gradient accumulation logic and also the gradient
hooks. As a result, now gradient hooks are called in distributed autograd as
well.
ghstack-source-id: 99838019

Test Plan: waitforbuildbot

Differential Revision: D19843284

fbshipit-source-id: 7923d7e871fb6afd3e98dba7de96606264dcb5f3
2020-03-10 01:56:08 -07:00
4f62cbe7de [ONNX] Support one_hot (#34454)
Summary:
This PR resolves https://github.com/pytorch/pytorch/issues/22534 by adding a converter for the `torch.nn.functional.one_hot` function, and covering it with a test.

Are there other places this should be tested?
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34454

Reviewed By: hl475

Differential Revision: D20354255

Pulled By: houseroad

fbshipit-source-id: 84224c1610b2cc7986c91441c65647ddc090750d
2020-03-09 22:26:36 -07:00
965146b818 [jit] delete netdef converter (#33807)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33807

afaik this is unused, so removing it from the source tree. RIP :(

Test Plan: Imported from OSS

Differential Revision: D20122118

Pulled By: suo

fbshipit-source-id: cb45943f5b9f969482301a2f9fe540326dbc78f2
2020-03-09 22:25:16 -07:00
3671036ef3 Adds true_divide function, analogous to Python 's, JAX's, NumPy's (true) division (#34236)
Summary:
See NumPy's division documentation here: https://numpy.org/doc/1.18/reference/generated/numpy.divide.html#numpy.divide.

True division is the same as PyTorch's default division except when both inputs are integer or bool tensors. In the latter case the inputs are (conceptually) cast to the default floating type before the division is performed.

The function is implemented for dense and sparse tensors and supports exporting to ONNX from PyTorch's eager mode or JIT traces. The function is inherently incompatible with exporting to ONNX via JIT script, and is another datapoint suggesting we should deprecate exporting scripted graphs to ONNX.

Tests are added for the type promotion, named tensor, and ONNX export behavior.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34236

Reviewed By: houseroad

Differential Revision: D20334087

Pulled By: mruberry

fbshipit-source-id: 83d00d886f46f713215d7d9e02ffd043164c57f1
2020-03-09 21:06:33 -07:00
e408d46477 Print pytorch version before running ASAN tests (#34521)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/34521

Test Plan: CI

Differential Revision: D20357233

Pulled By: malfet

fbshipit-source-id: 1c1b5a94a66d828383676a7a1403bbc13bb21c83
2020-03-09 20:52:46 -07:00
b9c32209db Use SerializedPyObj in PythonRpcHandler::generatePythonUDFResult (#34495)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/34495

Differential Revision: D20347466

Test Plan: Imported from OSS

Pulled By: mrshenli

fbshipit-source-id: 79625adb4ac3c9c6da4f40016e973bf17466c693
2020-03-09 20:41:05 -07:00
b82658810e Split deserialize from _run_function in RPC internal.py (#34494)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/34494

Differential Revision: D20347463

Test Plan: Imported from OSS

Pulled By: mrshenli

fbshipit-source-id: e6fd886622f26c46bb83ac118e67abb2f5b296b9
2020-03-09 20:41:00 -07:00
544fb64440 Use SerializedPyObj in PythonRpcHandler (#34493)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/34493

Differential Revision: D20347462

Test Plan: Imported from OSS

Pulled By: mrshenli

fbshipit-source-id: 9edda9eb95b1994464459271bb53ee77b760e474
2020-03-09 20:40:55 -07:00
18ef09f5ac Remove _load_return_value from RPC internal.py (#34492)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/34492

Differential Revision: D20347468

Test Plan: Imported from OSS

Pulled By: mrshenli

fbshipit-source-id: 92388d0d50a08fb895bacacf94c7b5495b4ae2b6
2020-03-09 20:40:50 -07:00
6d1c4df660 Consolidate Python Messages to use SerializedPyObj (#34491)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/34491

Differential Revision: D20347467

Test Plan: Imported from OSS

Pulled By: mrshenli

fbshipit-source-id: efae4111d961f3a528cede77c863fb049cda9029
2020-03-09 20:40:45 -07:00
3b661eb84c Avoid copy contents in SerializedPyObj (#34490)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/34490

Differential Revision: D20347465

Test Plan: Imported from OSS

Pulled By: mrshenli

fbshipit-source-id: d59e74e3ee9122992a5c50a083e43ab31b7a70f5
2020-03-09 20:38:54 -07:00
2de4fa702b [JIT] Preserve qualified names on traced modules (#34395)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34395

fixes: https://github.com/pytorch/pytorch/issues/33913

Test Plan: Imported from OSS

Differential Revision: D20347778

Pulled By: jamesr66a

fbshipit-source-id: 7b5a35b6f9678c34cb6127d531fa3bfe65703116
2020-03-09 19:23:53 -07:00
79e1305519 [net_runner] Get shape info from qtensors (#34321)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34321

Mostly cosmetic as we can infer the shape anyway. It can remove a lot of the noise in the log though.

Note that weight sharing doesn't work yet. I'll add another diff to address this.

Reviewed By: houseroad

Differential Revision: D20290841

fbshipit-source-id: fe6f9b60d05dbe150af15b5d9d7a69fd902e12cc
2020-03-09 18:34:16 -07:00
e16908cb1f profile block outputs; helps guard elimination (#33889)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/33889

Reviewed By: zdevito

Differential Revision: D20294979

Pulled By: Krovatkin

fbshipit-source-id: 2a68710ec8f8f854c99dfe173f49da442a39e498
2020-03-09 17:12:58 -07:00
2c1a302d6a [ROCm] Enable double __shfl_down (#34103)
Summary:
This allows us to enable some double-based pdist tests running into accrued error from casting down to float previously.

Addresses https://github.com/pytorch/pytorch/issues/33128
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34103

Differential Revision: D20343279

Pulled By: ezyang

fbshipit-source-id: a2da768259fab34ef326976283b7a15bebbbb979
2020-03-09 16:23:56 -07:00
0a4a558c2c Dictionary Constants (#32869)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/32869

Differential Revision: D19909339

Pulled By: Krovatkin

fbshipit-source-id: 6fe2a9b470768f84b957c69cdf9af3a1bd9b1ca9
2020-03-09 16:12:36 -07:00
90ff3b56d0 Kill some unused TH(C)Storage functions. (#34385)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/34385

Test Plan: Imported from OSS

Differential Revision: D20311064

Pulled By: gchanan

fbshipit-source-id: 6dc50621dc417e9ea4624cdebd0970453fa75a77
2020-03-09 16:03:56 -07:00
4e357089b4 Stop calling newWithSize directly. (#34384)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/34384

Test Plan: Imported from OSS

Differential Revision: D20311057

Pulled By: gchanan

fbshipit-source-id: 1e1a1f9b757b62f20d8d806f21abdd70f07b12aa
2020-03-09 16:03:51 -07:00
fea618b524 [JIT] remove list with default builtin (#34171)
Summary:
I think this was added when we couldn't compile the function itself. now we can.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34171

Differential Revision: D20269960

Pulled By: eellison

fbshipit-source-id: 0a60458d639995d9448789c249d405343881b304
2020-03-09 16:02:26 -07:00
34688d2c48 Add brand guidelines link (#34503)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/34503

Differential Revision: D20349273

Pulled By: soumith

fbshipit-source-id: 6b085377741ace5d200ca0d536de433b9bb7825c
2020-03-09 15:55:52 -07:00
2e7eef41ac [quant][graphmode] Swap quantized functional linear with aten::linear (#33853)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33853

Quant fusion relies on inline, but inline will break the CallFunction("linaer", ...) into a if block
it will be hard to recognize this block and swap it with quantized::linear, in order to
preserve the op, we will swap all quantized functional linear into aten::linear.
They might produce different backward graph, but this is called in the step before we get quantized
model, so it shouldn't affect anything.
We'll integrate this with convert_script later in the new "finalize_quant" API

Test Plan:
python test/test_jit.py

Imported from OSS

Differential Revision: D20343873

fbshipit-source-id: 423e03bf893b79267d2dc97bc997ee1bfe54ec0f
2020-03-09 15:45:20 -07:00
7688ca631a Enable RTTI for mobile builds, to enable custom class via torchbind in mobile (#34368)
Summary:
Custom classes via torchbind requires runtime type information.
We are trying to enable custom class based graph rewrite for XNNPACK in
this stacked PRs: https://github.com/pytorch/pytorch/pull/34047.
They require RTTI enabled for mobile. Mobile builds are failing
currently without it.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34368

Differential Revision: D20306155

Pulled By: kimishpatel

fbshipit-source-id: 52c61ff5467a619e8f51708a05258eee35dd0a56
2020-03-09 15:43:55 -07:00
2c0f3536b6 [jit] Make ModuleLists a sugared value (#34320)
Summary:
Previously when emitting subscripts we only emitted actual values, but
now they may sometimes emit a `ModuleValue`, so it should stay as a
`SugaredValue`. This allows for the result of the subscript to be
treated as a real module (i.e. you can just do `self.modlist[1](inputs)`
instead of `self.modlist[1].forward(inputs)`)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34320

Pulled By: driazati

Differential Revision: D20345642

fbshipit-source-id: 2bedf9a454af747b704422f6bbb8370cbdf4bf61
2020-03-09 15:36:46 -07:00
cyy
c218963270 fix more errors (#34480)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/34480

Differential Revision: D20345198

Pulled By: ezyang

fbshipit-source-id: 583246acd02850ead96f1f0574d01ef6697c6352
2020-03-09 14:54:15 -07:00
15a7b9cf0a [RpcAgent] Metrics for current num active/async rpc calls. (#34398)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34398

As part of PR 34109, it was suggested that we track the number of outstanding
async calls for RPC DebugInfo, particularly if we move towards using
at::launch() threads on occasion for continuations.

This particular aspect of the change was distinct from the main purpose of the
diff, and started getting bigger, so split this functionality out as a separate diff.
For completeness, we track client_active_calls, server_active_calls,
server_active_async_calls, and write some very basic unittest coverage.
ghstack-source-id: 99708836

Test Plan: buck test mode/dev-nosan caffe2/torch/fb/distributed/thriftRpcBackend/...

Differential Revision: D20314994

fbshipit-source-id: 2f7c75d5c511b27ed0c09c7b8a67b6fb49df31a5
2020-03-09 13:34:59 -07:00
8294db8f15 [iOS][CI] Remove org-member from iOS Simulator Builds (#34410)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34410

### Summary

Currently, the iOS jobs are not being run on PRs anymore. This is because all iOS jobs have specified the `org-member` as a context which used to include all pytorch members. But seems like recently this rule has changed. It turns out that only users from the admin group or builder group can have access right to the context values. https://circleci.com/gh/organizations/pytorch/settings#contexts/2b885fc9-ef3a-4b86-8f5a-2e6e22bd0cfe

This PR will remove `org-member` from the iOS simulator build which doesn't require code signing. For the arm64 builds, they'll only be run on master, not on PRs anymore.

### Test plan

- The iOS simulator job should be able to appear in the PR workflow

Test Plan: Imported from OSS

Differential Revision: D20347270

Pulled By: xta0

fbshipit-source-id: 23f37d40160c237dc280e0e82f879c1d601f72ac
2020-03-09 13:22:54 -07:00
776d2a1e8f [quant][graphmode] Handling ops doesn't require observation in insertObservers (#33481)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33481

We have to propagate observed property of values through ops like max_pool2d, flatten and
avoid inserting duplicated observers.
For example:
```
x1 = self.conv(x)
x2 = maxpool(x1)
x3 = self.conv(x2)
```
If x1 is observed, we should propagate this information through maxpool and
we should consider x2 as observed as well.

Test Plan:
python test/test_jit.py

Imported from OSS

Differential Revision: D20261897

fbshipit-source-id: 7de354a3ccb2b6e1708f5c743d4d9f7272691a93
2020-03-09 13:15:54 -07:00
2b45368e50 Fix cudnn 64bit indexing issue (#34407)
Summary:
Fix https://github.com/pytorch/pytorch/issues/33143
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34407

Differential Revision: D20325106

Pulled By: ngimel

fbshipit-source-id: 5aa52295f5491f189b7a8bea0987f28de0589d98
2020-03-09 12:35:55 -07:00
e025677e3c Remove **kwargs from torch.meshgrid (#34356)
Summary:
Changelog:
- Remove **kwargs from torch.meshgrid as they serve no purpose

Closes https://github.com/pytorch/pytorch/issues/34206
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34356

Differential Revision: D20310971

Pulled By: zou3519

fbshipit-source-id: 97250051504aa3ec1e2a9af9296e7cc71872e5bf
2020-03-09 12:07:43 -07:00
70fe508c26 [pytorch] fix BUILD_CAFFE2_MOBILE gating around caffe2/operators/experimental/c10/cpu (#34354)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34354

The condition `NOT INTERN_BUILD_MOBILE AND NOT BUILD_CAFFE2_MOBILE` was
added in #27086, but seems it's always false on current master:

BUILD_CAFFE2_MOBILE is ON by default - the name is a little bit misleading -
it is ON even when it's building non-mobile PyTorch/Caffe2. It is OFF only
when it's building PyTorch mobile, where INTERN_BUILD_MOBILE is ON.

And when it's building PyTorch mobile, it won't build caffe2/operators
at all (by setting BUILD_CAFFE2_OPS OFF: https://github.com/pytorch/pytorch/blob/master/CMakeLists.txt#L345)

So I imagine the real intention is to skip when it's building Caffe2 mobile.
We can simply remove the deprecating BUILD_CAFFE2_MOBILE condition.

Test Plan: Imported from OSS

Differential Revision: D20345298

Pulled By: ljk53

fbshipit-source-id: d2cb4e2248fc209d63b2843e0f12e577e323def4
2020-03-09 12:00:57 -07:00
6d3783a6bc Clean up unused newWithSize variants. (#34383)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/34383

Test Plan: Imported from OSS

Differential Revision: D20311065

Pulled By: gchanan

fbshipit-source-id: 9fc2cc4377f32c865401b04868a7405c49929c64
2020-03-09 11:19:30 -07:00
91e922a338 [AI Bench] Add support for nlu model
Summary: add support for nlu specific input

Test Plan:
tested

```
buck run aibench:run_bench -- -b aibench/specifications/models/pytorch/fbnet/assistant_mobile_inference.json --platform android/full_jit --framework pytorch --remote --devices  SM-G950U-7.0-24
```
make sure it compatible with previous test
```
buck run aibench:run_bench -- -b aibench/specifications/models/pytorch/fbnet/fbnet_mobile_inference.json --platform android/full_jit --framework pytorch --remote --devices  SM-G950U-7.0-24
```

```
{
  "model": {
    "category": "CNN",
    "description": "Assistant Mobile Inference",
    "files": {
      "model": {
        "filename": "model.pt1",
        "location": "//everstore/GICWmAB2Znbi_mAAAB0P51IPW8UrbllgAAAP/model.pt1",
        "md5": "c0f4b29c442bbaeb0007fb0ce513ccb3"
      },
      "data": {
        "filename": "input.txt",
        "location": "/home/pengxia/test/input.txt",
        "md5": "c0f4b29c442bbaeb0007fb0ce513ccb3"
      }
    },
    "format": "pytorch",
    "framework": "pytorch",
    "kind": "deployment",
    "name": "Assistant Mobile Inference"
  },
  "tests": [
    {
      "command": "{program} --model {files.model}  --input_dims \"1\" --input_type NLUType --warmup {warmup} --iter {iter} --input_file {files.data} --report_pep true",
      "identifier": "{ID}",
      "metric": "delay",
      "iter": 5,
      "warmup": 2,
      "log_output": true
    }
  ]
}

```
input.txt
```
what is weather today
what time it is
set a reminder for tomorrow
```

result
https://our.intern.facebook.com/intern/aibench/details/137241352201417

Reviewed By: kimishpatel

Differential Revision: D20300947

fbshipit-source-id: 7c1619541a2e9514a560a9acb9029cfc4669f37a
2020-03-09 10:39:49 -07:00
bcfd348858 [ONNX] Export new_zeros (#34077)
Summary:
ONNX export for new_zeros op added.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34077

Reviewed By: hl475

Differential Revision: D20332074

Pulled By: houseroad

fbshipit-source-id: 4235c4f2c279c37aa8dde6d13c1b26f621967768
2020-03-09 10:38:22 -07:00
baeb359e7a Remove using namespace torch::autograd from header files (#34423)
Summary:
This PR prevents leaking symbols from `torch::autograd` namespace to the root namespace.
Fixes https://github.com/pytorch/pytorch/issues/34371.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34423

Differential Revision: D20338404

Pulled By: yf225

fbshipit-source-id: e7ff3348193667a0cee5d38f9a003ae36cc704ca
2020-03-09 10:31:21 -07:00
e3d50c4dda Retain the order of parameters while generating ConcreteModuleTypes (#34131)
Summary:
`ConcreteModuleTypeBuilder` used to keep parameters together with all others attributes in an `unordered_map` often leading to reordering them while building up the type. Parameter order is semantically meaningful, so we need to preserve it.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34131

Differential Revision: D20331542

Pulled By: suo

fbshipit-source-id: 5b860025f7902654d6099751d3fb14b12f6f5a67
2020-03-09 10:25:45 -07:00
f62a7e7efb Simplify implementation of newWithStorage1d. (#34382)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34382

The previous implementation was handling both newWithStorage and newWithSize, which doesn't make much sense.

Test Plan: Imported from OSS

Differential Revision: D20311056

Pulled By: gchanan

fbshipit-source-id: 2696a4566e6203c98338c86cbf4c236bd18d7c49
2020-03-09 10:18:44 -07:00
b1bd950a4d Fixed stub for AdamW (#34299)
Summary:
Fixes [https://github.com/pytorch/pytorch/issues/33757](https://github.com/pytorch/pytorch/issues/33757)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34299

Differential Revision: D20337844

Pulled By: ezyang

fbshipit-source-id: 54bf174a09b8db9bf6e0c3c717730dd7c795d76b
2020-03-09 08:45:51 -07:00
739d4609c3 [C++ API] Fix ModuleList compile error: error: 'begin' was not declared in this scope (#34463)
Summary:
One example in the current docs for `torch::nn::ModuleList` doesn't compile, and this PR fixes it.
Fixes https://github.com/pytorch/pytorch/issues/32414.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34463

Test Plan: Imported from GitHub, without a `Test Plan:` line.

Differential Revision: D20331120

Pulled By: yf225

fbshipit-source-id: 50bb078fe1a900c9114d5434e92dc40ee13b52bf
2020-03-09 08:15:50 -07:00
b09e90af1e Fix C++ at::Tensor docs generation (#34467)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/25845.

**Test Plan:**
Check `pytorch_cpp_doc_push` CI job, and see if there is `classat_1_1_tensor` generated (similar to `structat_1_1native_1_1_convolution_descriptor`).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34467

Differential Revision: D20338190

Pulled By: yf225

fbshipit-source-id: 52dc05af5e0d742e740de5576d0d2b3e17ef28dd
2020-03-09 08:04:32 -07:00
6e2bb1c054 End of the .data removal in torch/optim (#34211)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/34211

Test Plan: Imported from OSS

Differential Revision: D20248684

Pulled By: albanD

fbshipit-source-id: 2294bfa41b82ff47f000bc98860780f59d7d4421
2020-03-09 06:40:39 -07:00
7e55494502 Warns on read-only Numpy array->tensor conversion (#33615)
Summary:
Addresses https://github.com/pytorch/pytorch/issues/5442.

Per title (and see issue). A test is added to test_torch.py to verify the behavior.

Update (with new behavior):

NumPy arrays can be non-writeable (read-only). When converting a NumPy array to a Torch tensor the storage is shared, but the tensor is always writable (PyTorch doesn't have a read-only tensor). Thus, when a non-writeable NumPy array is converted to a PyTorch tensor it can be written to.

In the past, PyTorch would silently copy non-writeable NumPy arrays and then convert those copies into tensors. This behavior violates the from_numpy contract, however, which promises that the tensor and the array share memory.

This PR adds a warning message when a non-writeable NumPy array is converted into a Torch tensor. This will not break any networks, but will make end users aware of the behavior. They can work-around the warning message by marking their NumPy arrays as writeable.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33615

Differential Revision: D20289894

Pulled By: mruberry

fbshipit-source-id: b76df0077399eb91038b12a6bf1917ef38c2cafd
2020-03-08 20:03:50 -07:00
79d47c1c5f Fix the missing ';' in Conv.cpp (#34448)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/34415.
BTW, isn't this tested on CI? Maybe we need to introduce some tests with legacy versions of cuDNN.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34448

Differential Revision: D20325104

Pulled By: ngimel

fbshipit-source-id: f03dec30ffa6e50a28ee8103d7d49cd6fc0a6d69
2020-03-07 21:43:18 -08:00
7d9f611b64 Add worker_name helper to dist_utils. (#34162)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34162

This avoids the "worker{}".format(..) in our unit tests to something
cleaner.
ghstack-source-id: 99713074

Test Plan: waitforbuildbot

Differential Revision: D20233533

fbshipit-source-id: 5cff952ca68af5a6d26dc5cc01463cf7756d83d9
2020-03-07 13:24:45 -08:00
8a17dc65af [quantization] Make FP16 RNN use new prepack op (#34339)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/34339

Test Plan: Imported from OSS

Differential Revision: D20297194

Pulled By: jamesr66a

fbshipit-source-id: 8bf6d0f2cb047e90bbdd184aaad337b143040d10
2020-03-07 10:04:01 -08:00
45a504dd2d [JIT] Introduce BuiltinOpFunction and integrate into torchbind (#34098)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34098

* #33900 [JIT] Move stuff out of class_type.cpp

Test Plan: Imported from OSS

Differential Revision: D20229166

Pulled By: jamesr66a

fbshipit-source-id: d658a63a5d6e372e675f35b8456adc8de82b49f3
2020-03-07 10:03:56 -08:00
60e8615a6d [JIT] Virtualize Function (#33921)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33921

**NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.intern.facebook.com/intern/diff/D20153092/)!

Test Plan: Imported from OSS

Differential Revision: D20177227

Pulled By: jamesr66a

fbshipit-source-id: 87f3e484c4f873d60f76f50f6789c1b4a73bdfde
2020-03-07 10:03:50 -08:00
bb1114258c [JIT] Move stuff out of class_type.cpp (#33900)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33900

These functions don't require any libtorch-specific functionality, so move them into the header so they're included in the ATen build

Test Plan: Imported from OSS

Differential Revision: D20175874

Pulled By: jamesr66a

fbshipit-source-id: 1efab1b60e196a635e6c6afadb042b63771170f0
2020-03-07 10:02:32 -08:00
65bad41cbe Fixed typos in quantization docs / docstrings (#34182)
Summary:
Removed extra back quote character.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34182

Differential Revision: D20320146

Pulled By: jerryzh168

fbshipit-source-id: 33c347711a052cc55f7d1a41ed959dadf99a3d7d
2020-03-06 21:53:52 -08:00
c5e822b7bb Back out "[jit] Add type tags to lists/dicts in pickle" (#34406)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34406

Pull Request resolved: https://github.com/pytorch/pytorch/pull/34405

Original commit changeset: 2f1826e6679a

Test Plan: reverting, see S197156

Reviewed By: akyrola, volkhin

Differential Revision: D20317456

fbshipit-source-id: 89298a9c022edba1d54bcdc7541804cb919e33f5
2020-03-06 20:02:16 -08:00
392afb9f8b Fix overlapping keywords (#34142)
Summary:
This commit fixes overlapping keywords in the CPP Docs
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34142

Test Plan: Imported from GitHub, without a `Test Plan:` line.

Differential Revision: D20319949

Pulled By: yf225

fbshipit-source-id: e7bb2efdc286c85792c6f18a260c3bba33c54008
2020-03-06 19:16:21 -08:00
b0479506a8 Add the 3d avg pool for video related model (#33339)
Summary:
```
import torch, time

for dtype in [torch.qint8, torch.quint8, torch.qint32]:
    print('****', str(dtype), '*****')
    x = torch.rand(1, 5, 56, 56, 256)

    q_x = torch.quantize_per_tensor(x, 0.5, 1, dtype)
    q_x = q_x.permute([0, 4, 1, 2, 3])

    x = x.permute([0, 4, 1, 2, 3])

    NITER = 10

    s = time.time()
    for i in range(NITER):
        float_out = torch.nn.functional.avg_pool3d(x, kernel_size=3, stride=None, padding=0)
    time_per_iter_float = (time.time() - s) / NITER

    s = time.time()
    for i in range(NITER):
        quant_out = torch.nn.quantized.functional.avg_pool3d(q_x, kernel_size=3, stride=None, padding=0)
    time_per_iter_quant = (time.time() - s) / NITER
    print('time/iter ms (float)', 'time/iter ms (quant)', 'quant/float', sep='\t')
    print(time_per_iter_float * 1000, time_per_iter_quant * 1000, time_per_iter_quant / time_per_iter_float, sep='\t')
```

```
**** torch.qint8 *****
time/iter ms (float)  time/iter ms (quant)  quant/float
16.286182403564453  0.7308721542358398  0.04487682479080417
**** torch.quint8 *****
time/iter ms (float)  time/iter ms (quant)  quant/float
15.364313125610352  0.6497383117675781  0.042288796541418254
**** torch.qint32 *****
time/iter ms (float)  time/iter ms (quant)  quant/float
15.649032592773438  13.879132270812988  0.8869003363966556
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33339

Differential Revision: D19900904

Pulled By: lly-zero-one

fbshipit-source-id: 4522cc6b4a0751aeda6c7edc258e0cb3f55a8fe3
2020-03-06 17:44:34 -08:00
d98516026e [PyTorch BC] Clean up the BC whitelist (#34393)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34393

Clean up the list

Test Plan: CI

Reviewed By: hl475

Differential Revision: D20300530

fbshipit-source-id: 50e7da0a9f8295eff33590982f32f84abee96d9c
2020-03-06 16:10:20 -08:00
ccf6fab65e Fix doc and type hints for "torch.add"; fix deprecated python calls in tests (#33935)
Summary:
This PR fixed documentation for `torch.add` with alpha. It also fixed these deprecated python calls `torch.add` and `torch.addmm` in tests, which may affect performance in *test/test_sparse.py* and *test/test_nn.py*.

cc csarofeen ptrblck
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33935

Differential Revision: D20313320

Pulled By: ngimel

fbshipit-source-id: fb08413d7e244865952e3fc0e1be7f1794ce4e9a
2020-03-06 15:53:58 -08:00
01edb7450f [Lite Trainer] Add necessary registrations for MNIST model (#33717)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33717

Because of the special treatment of operator names for lite interpreter, all the operators used in lite interpreter are still prepended by "_". Add the necessary registrations for MNIST model. All the ops with autograd capability are included in torch_mobile_train. After rebase the selective build from D19649074 can be utilized to strip the unused ops.

Note that this diff is for feasibility test. The training accuracy are not covered in the test.
ghstack-source-id: 97780066

Test Plan:
```
buck run xplat/caffe2/fb/lite_trainer:lite_trainer -c pt.disable_gen_tracing=1 -c pt.static_dispatch=0 -- --model=/path/MnistModel.bc
```
{F227898221}

Reviewed By: dreiss

Differential Revision: D19743201

fbshipit-source-id: cacadd76f3729faa0018d147a69466bbf54312fd
2020-03-06 15:49:03 -08:00
96ca06cfce Add nhwc memory format test for dropout (#34379)
Summary:
cc: ptrblck
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34379

Differential Revision: D20310118

Pulled By: ngimel

fbshipit-source-id: a9bafd6b8fbcb57443e22181cf6bd9879b6f6051
2020-03-06 15:43:21 -08:00
37dfc6c498 Reenable large conv tests (#34259)
Summary:
Please merge after https://github.com/pytorch/pytorch/pull/33073

With that PR, we are now trying different algorithms when OOM, so hopefully there will be some algo working at low memory.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34259

Differential Revision: D20310094

Pulled By: ngimel

fbshipit-source-id: bccd8162bd06a0e54ac6f42a7fd9a5b766f92cd7
2020-03-06 15:36:54 -08:00
516a587438 Enhance reproducibility documentation (#33795)
Summary:
Improves explanation of non-determinism when running on GPUs. Adds info about `torch.nn.BCELoss` operating non-deterministically on GPUs.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33795

Differential Revision: D20284880

Pulled By: ngimel

fbshipit-source-id: d543959636d261a80c234150304344b19a37ba5d
2020-03-06 15:32:04 -08:00
079de7f376 .circleci: Remove macOS builds related to CUDA (#34333)
Summary:
We don't release binaries for macOS with CUDA support so we should just
remove it from our regular PR pipeline

Signed-off-by: Eli Uriegas <eliuriegas@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34333

Differential Revision: D20312565

Pulled By: seemethere

fbshipit-source-id: 376228680aa0e814d1b37f1ff63b7d1262515e44
2020-03-06 13:18:06 -08:00
2d3f6cbf03 .circleci: Update default smoke tests from cuda 10.0 -> 10.2 (#34328)
Summary:
Now that https://github.com/pytorch/pytorch/issues/34241 is merged, we can update these to the latest cuda version to get a better signal.

Signed-off-by: Eli Uriegas <eliuriegas@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34328

Differential Revision: D20312552

Pulled By: seemethere

fbshipit-source-id: 8e6bf797e067500d5dd9a607c6c19465028637bc
2020-03-06 13:11:58 -08:00
5608ffc46c [PyTorch] Remove const modifiers from passed by value integers in qbatch_norm_fn (#34378)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34378

This fixes strange symbol mangling mismatch beteen `DECLARE_DISPATCH(qbatch_norm_fn, qbatch_norm_stub)` and `REGISTER_DISPATCH(qbatch_norm_stub, &q_batch_norm_kernel<false>);` if code is build on Windows with clang

Test Plan: CI + build PyTorch on Windows using clang

Reviewed By: EscapeZero

Differential Revision: D20309550

fbshipit-source-id: e97c7c3b6fee2e41ea6b2f8167ce197aec404e3d
2020-03-06 13:04:54 -08:00
c6ea71b6e8 Fix Conv.cpp, &&= is not a C++ operator (#34381)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/34381

Differential Revision: D20310674

Pulled By: ngimel

fbshipit-source-id: a453c1d07bcf7aead7402f091bccb4af7b1ec690
2020-03-06 12:38:58 -08:00
5f641f93f1 [aten] Don't deadlock in IValue::Future impl, tests. (#34099)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34099

This change effectively applies into IValue's future impl a few fixes
we discovered when using the torch::utils::Future<T> impl.

The parallel impls should probably eventually be merged, but until then:

  - Don't hold the lock when invoking the callbacks. This makes
    it effectively impossible (deadlocks) to call value() to get
    the value from inside the callback.

  - We discovered that it was slightly cleaner in practice to
    notify condition variables prior to invoking callbacks
    (best to unblock paused threads ASAP, before spawning new work).

  - Fix some var naming inconsistency.
  - Add a some caffe2 cpp test coverage.
ghstack-source-id: 99336569

Test Plan:
```
buck test mode/dev //caffe2/test/cpp/jit:jit -- 'JitTest\.IValueFuture'

```

Differential Revision: D20203278

fbshipit-source-id: 6e805ba547899dab9aab458e4b23049db31f930e
2020-03-06 12:34:50 -08:00
0489b8da42 Add scripts to promote S3 artifacts from test channels to stable channels (#34274)
Summary:
Currently testing against the older release `1.4.0` with:
```
PYTORCH_S3_FROM=nightly TEST_WITHOUT_GIT_TAG=1 TEST_PYTORCH_PROMOTE_VERSION=1.4.0 scripts/release/promote/libtorch_to_s3.sh
PYTORCH_S3_FROM=nightly TEST_WITHOUT_GIT_TAG=1 TEST_PYTORCH_PROMOTE_VERSION=1.4.0 scripts/release/promote/wheel_to_s3.sh
```

These scripts can also be used for `torchvision` as well which may make the release process better there as well.

Later on this should be made into a re-usable module that can be downloaded from anywhere and used amongst all pytorch repositories.

Signed-off-by: Eli Uriegas <eliuriegas@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34274

Test Plan: sandcastle_will_deliver

Differential Revision: D20294419

Pulled By: seemethere

fbshipit-source-id: c8c31b5c42af5096f09275166ac43d45a459d25c
2020-03-06 12:18:16 -08:00
879a90b322 [ModelLoading] Use byte encoding for uint8, fp16 etc. instead of int32 (#34343)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34343

Use byte encoding for uint8, fp16 etc. instead of int32 in TensorProto serialization/deserialization

tl;dr
- fp16 tensor deserialization 12x faster, serialized size 25% lower
- uint8 tensor deserialization 36x faster, serialized size 25% lower

Test Plan:
```
============================================================================
caffe2/caffe2/fb/predictor/ModelLoaderBenchmark.cpprelative  time/iter  iters/s
============================================================================
BlobProtoInt32DeserializationFloat16                        12.37ms    80.82
BlobProtoByteDeserializationFloat16             1125.46%     1.10ms   909.64
----------------------------------------------------------------------------
BlobProtoInt32DeserializationUInt8                          17.57ms    56.92
BlobProtoByteDeserializationUInt8               3629.45%   484.02us    2.07K
============================================================================
```

Reviewed By: yinghai

Differential Revision: D20137451

fbshipit-source-id: 8ed4be2286a6d4c7e134fcb0832f22bc645039a1
2020-03-06 11:58:30 -08:00
98afce3c56 Remove unnecessary assert in autograd engine (#34307)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/34307

Test Plan: Imported from OSS

Differential Revision: D20283401

Pulled By: albanD

fbshipit-source-id: 34f6eb8955b7d9cb259260abc1056ddd9f354107
2020-03-06 11:45:46 -08:00
6d8a0f6731 [Aten] Init container iterators to an unsigned type (#34159)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34159

This fixes `comparison of integers of different sign` warnings

Test Plan: CI

Reviewed By: EscapeZero

Differential Revision: D20232085

fbshipit-source-id: 8f325be54395be54c704335cb7edf2ec7ef75e75
2020-03-06 10:35:43 -08:00
4c99351de6 [AMD] Remove num_gpu check for remote execution (#34318)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34318

Stop checking whether we have AMD GPU devices on the host, because we may be constructing a net on a machine without GPU, and run the net on another one with GPU

Reviewed By: ajauhri

Differential Revision: D20269562

fbshipit-source-id: 1f561086cacdcead3ce7c03c2d02c25336c8b11a
2020-03-06 09:53:57 -08:00
4872b126fd [aten] remove stmt unreachable, variable never used warnings (#34017)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34017

Remove warning
```
caffe2/aten/src/THC/generic/THCTensorMathBlas.cu(437): warning: statement is unreachable
caffe2/aten/src/THC/generic/THCTensorMathBlas.cu(271): warning: variable "transpose_m1" was set but never used
caffe2/aten/src/THC/generic/THCTensorMathBlas.cu(271): warning: variable "transpose_m2" was set but never used
```

Test Plan: CI

Reviewed By: ngimel

Differential Revision: D20181179

fbshipit-source-id: 3665912ba55bffbd8b4555f8a6803e57a502c103
2020-03-06 09:52:43 -08:00
82a177c07f [c10] remove warning attribute does not apply to any entity (#34018)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34018

Remove warning
```
caffe2/c10/util/ArrayRef.h(278): warning: attribute does not apply to any entity
```

Test Plan: CI

Reviewed By: jianyuh

Differential Revision: D20181191

fbshipit-source-id: 58bd168a87a94fec925c7cde8b8d728a4257446c
2020-03-06 09:47:10 -08:00
17ceb6941f [RPC] Create local RRef<ModuleInterface> remotely in Python, use it remotely in TorchScript (#34183)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34183

https://github.com/pytorch/pytorch/pull/33263 enhanced the RRef Python constructor to infer most types, by `jit::tryToInferType(..)`.

But this helper function can't infer `ScriptModule` type due to `ScriptModule`'s special per-Module type singleton logic, so it's still not possible for an Python-created RRef to know the JIT type of it's contained `ScriptModule`.

Instead of inferring the specific type of a Module, which could leads to too many candidate types (due to Module's multiple inheritance possibility), it's more straightforward to set it's type as a user-specified `ModuleInterface` type.

We added an optional argument `type_hint` for users to mark an `RRef` for what `ModuleInterface` type it's holds.

ghstack-source-id: 99649379

(Note: this ignores all push blocking failures!)

Test Plan:
Aspects that need to be confirmed in the test cases

https://fb.quip.com/aGxRAh2lCg05

```
buck test mode/dev-nosan //caffe2/test/distributed/rpc/jit:rpc_fork

buck build mode/dev-nosan //caffe2/test/distributed/rpc/jit:rpc_fork \
&& buck-out/gen/caffe2/test/distributed/rpc/jit/rpc_fork\#binary.par -r test_create_local_script_class_rref

buck build mode/dev-nosan //caffe2/test/distributed/rpc/jit:rpc_fork \
&& buck-out/gen/caffe2/test/distributed/rpc/jit/rpc_fork\#binary.par -r test_create_local_script_module_rref

buck build mode/dev-nosan //caffe2/test/distributed/rpc/jit:rpc_fork \
&& buck-out/gen/caffe2/test/distributed/rpc/jit/rpc_fork\#binary.par -r test_return_local_script_class_rref_in_py_and_use_in_script

buck build mode/dev-nosan //caffe2/test/distributed/rpc/jit:rpc_fork \
&& buck-out/gen/caffe2/test/distributed/rpc/jit/rpc_fork\#binary.par -r test_return_local_script_module_rref_in_py_and_use_in_script

buck build mode/dev-nosan //caffe2/test/distributed/rpc/jit:rpc_fork \
&& buck-out/gen/caffe2/test/distributed/rpc/jit/rpc_fork\#binary.par -r test_torchscript_function_exception
```

Differential Revision: D7065050

fbshipit-source-id: e10210c0996622969e499e4a35b0659b36787c1c
2020-03-06 08:28:22 -08:00
a7da4490cc Clean up some legacy scalar/empty handling. (#34217)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34217

LegacyNoScalar variants cause 0-dim tensors to behave like 1-dim tensors.
LegacyAll variants cause 0-dim tensors to behave like 1-dim tensors, and numel == 0 tensors to be treated like 0-dimensional tensors.

This this was done by codemod, these are often unneeded and often translated incorrectly to ATen.

Test Plan: Imported from OSS

Differential Revision: D20249577

Pulled By: gchanan

fbshipit-source-id: 6f2876d3e479562c9323f3629357a73a47869150
2020-03-06 08:13:31 -08:00
9c5578fd0a Make sure Vec256 int32_t and int16_t loadu temprary arrays are properly initialized (#34281)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34281

Seems like #32722 has missed two loadu functions

Test Plan: Imported from OSS

Differential Revision: D20287731

Pulled By: albanD

fbshipit-source-id: d959b2508de3f9f660368152d7260026d7fbccbe
2020-03-06 07:55:45 -08:00
35b6d2945d Tensor.random_ check that from and to are in tensor dtype bounds (#34033)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/34033

Test Plan: Imported from OSS

Differential Revision: D20182414

Pulled By: pbelevich

fbshipit-source-id: 3704570ead7de169ce13c81164be0aff0806fb46
2020-03-06 07:22:47 -08:00
30680196e4 Revert D20121915: [JIT] Add support for list()
Test Plan: revert-hammer

Differential Revision:
D20121915

Original commit changeset: c6c4ef444dbf

fbshipit-source-id: 829adb58780f4d0f41acebb3e7640a9c68bdbc1b
2020-03-06 07:16:40 -08:00
f9f135c5d8 ChannelsLast3d support is_contiguous, contiguous, suggest_memory_format, caching (#33033)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/33033

Test Plan: Imported from OSS

Differential Revision: D19759661

Pulled By: glaringlee

fbshipit-source-id: 6c4798fa93589338c0c71c5308b9fd1151330245
2020-03-06 06:02:03 -08:00
415595ace4 [C++ API] Remove init-list form of at::indexing::Slice (#34255)
Summary:
The init-list form of `at::indexing::Slice` (i.e. `tensor.index({{1, None, 2}, ...})` instead of `tensor.index({Slice(1, None, 2), ...})`) in C++ API can be easily confused with the list-form indexing in Python API (e.g. `tensor[[1, 3, 2], ...]`), which is not good from readability perspective. This PR removes the init-list form of `at::indexing::Slice` to make the API less confusing.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34255

Test Plan: Imported from GitHub, without a `Test Plan:` line.

Differential Revision: D20290166

Pulled By: yf225

fbshipit-source-id: abbcbeca0b179219e5e1f196a33ef8aec87ebb76
2020-03-06 05:51:53 -08:00
b8fd88319a C++ make torch::nn::Sequential push_back(AnyModule) methods public (#34208)
Summary:
Issue https://github.com/pytorch/pytorch/issues/33192
Moves Sequential::push_back methods with AnyModule from private -> public
Allows adding an existing AnyModule via something like:

```
  torch::nn::Sequential q;
  auto a=torch::nn::AnyModule(torch::nn::Linear(1,2));
  q->push_back(a);
  q->push_back("fc",a);
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34208

Differential Revision: D20300278

Pulled By: yf225

fbshipit-source-id: 4525319bb7fb6667e43a006c9f446a2193781005
2020-03-06 05:47:14 -08:00
9a5e9d8cec [pytorch][mobile] change mobile build scripts to build PyTorch by default (#34203)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34203

Currently cmake and mobile build scripts still build libcaffe2 by
default. To build pytorch mobile users have to set environment variable
BUILD_PYTORCH_MOBILE=1 or set cmake option BUILD_CAFFE2_MOBILE=OFF.

PyTorch mobile has been released for a while. It's about time to change
CMake and build scripts to build libtorch by default.

Changed caffe2 CI job to build libcaffe2 by setting BUILD_CAFFE2_MOBILE=1
environment variable. Only found android CI for libcaffe2 - do we ever
have iOS CI for libcaffe2?

Test Plan: Imported from OSS

Differential Revision: D20267274

Pulled By: ljk53

fbshipit-source-id: 9d997032a599c874d62fbcfc4f5d4fbf8323a12e
2020-03-05 23:40:47 -08:00
b50825e011 Make RecordFunction more robust for async use cases (#34122)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34122

Earlier work added support for async rpc cases when RecordFunction's
end callbacks might be called in a different thread; in addition some
extra care was needed to handle pointer to parent function;

This PR makes RecordFunction aware of potentially multiple threads in
use, as well as removes unused parent() call and restricts current()
RecordFunction to scope-based record functions (RECORD_FUNCTION macro)

Test Plan: unit tests

Differential Revision: D20297709

Pulled By: ilia-cher

fbshipit-source-id: 46a59e1b2eea0bbd8a59630385e193b38d30f9d1
2020-03-05 22:28:53 -08:00
38857734f0 [JIT] fix py35 test (#34350)
Summary:
test_module_interfaces was using syntax only supported in >= 3.6
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34350

Reviewed By: mrshenli

Differential Revision: D20298869

Pulled By: eellison

fbshipit-source-id: 22319ca403113cff2eedf57767bb34d9580e6db3
2020-03-05 21:31:19 -08:00
76035f050b [C++ API Parity] Adam: updated step and class design (#33730)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/33730

Differential Revision: D20292073

Pulled By: anjali411

fbshipit-source-id: a7b4a70f29027ab355aebb91873ea55d5cb51783
2020-03-05 19:15:24 -08:00
f4da78f1b3 Remove RPC TorchScript private API (#33978)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33978

We can directly pass user_callbale to rpc_async API in TorchScript. There is no need to have private API for taking qualified name.
ghstack-source-id: 99600360

Test Plan:
```
buck test mode/dev-nosan //caffe2/test/distributed/rpc/jit:rpc_fork

buck build mode/dev-nosan //caffe2/test/distributed/rpc/jit:rpc_fork \
&& buck-out/gen/caffe2/test/distributed/rpc/jit/rpc_fork\#binary.par -r test_torchscript_functions_not_supported
```

Differential Revision: D7420993

fbshipit-source-id: 228c15b21848e67418fab780e3fd6a1c6da5142d
2020-03-05 18:35:05 -08:00
02478984d6 Add support to dump unsupported ops. Add lite_interpter_load test. (#34278)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34278

This diff helps check all the ops not supported by lite_interpreter.
Helpful mainly to find all the ops that need to be added instead of adding them
one by one.

Test Plan:
buck run caffe2/binaries:lite_interpreter_model_load --
--model=<bytecode-model-path>

Reviewed By: iseeyuan

Differential Revision: D20266341

fbshipit-source-id: 5a6c7a5bc52f910cea82a72045870da8105ccb87
2020-03-05 18:31:31 -08:00
434af5d94a [quant] Speed up per-channel min-max observer (#34118)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34118

Previously calc_per_channel_qparams was using for loops and python primitives, which called `item` many times causing slowdown during training.
    These changes uses torch primitives on the tensor to speed up the operation over 60x

    Perf results on MobileNetV2 during training using autograd profiler

    FP32 forward call -
    Self CPU time total: 47.222ms
    CUDA time total: 124.001ms

    before change
    FakeQuant Model -
    Self CPU time total: 19.107s
    CUDA time total: 27.177s

    after change
    FakeQuant Model -
    Self CPU time total: 404.667ms
    CUDA time total: 446.344ms

Test Plan:
python test/test_quantization.py

Imported from OSS

Differential Revision: D20287841

fbshipit-source-id: 6b706b8206e0d0da3c3c217b014e8da5b71b870d
2020-03-05 18:29:41 -08:00
d2b5eb2a45 [ONNX] Fix for random generators export (#33789)
Summary:
Export random generator with dynamic input size
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33789

Reviewed By: hl475

Differential Revision: D20121175

Pulled By: houseroad

fbshipit-source-id: c16d11eb07678166d125759d97aadfcd7c80ef14
2020-03-05 17:58:54 -08:00
89d314b5d5 [pytorch] update mobile docker image version (#34337)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/34337

Test Plan: Imported from OSS

Differential Revision: D20296975

Pulled By: ljk53

fbshipit-source-id: bc4a39689dca22e4530f25225f1884eda9bc74de
2020-03-05 17:47:36 -08:00
1cf12b7e53 [quant] Fix histogram observer to work with QAT on GPU (#34232)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34232

By default `torch.zeros` creates the tensor on GPU. Need to specify the device argument to get it to work correctly on GPU during QAT.

Test Plan:
1. Tested by running QAT on GPU

2. python test/test_quantization.py

Imported from OSS

Differential Revision: D20286351

fbshipit-source-id: 745723c85d902870c56c1c7492f26cb027ae9dc6
2020-03-05 17:19:12 -08:00
e4a883e601 cuDNN convolution try multiple algo (#33073)
Summary:
Fixes: https://github.com/pytorch/pytorch/issues/31336 https://github.com/pytorch/pytorch/issues/1664

Sometimes cuDNN heuristics return algorithms that can not be used. Instead of just using the first algorithm returned, we should try these algorithms one by one until one of them succeed.

Benchmark:
https://github.com/zasdfgbnm/things/blob/master/2020Q1/conv-benchmark.ipynb
```python
i = torch.randn(256, 3, 256, 256).cuda()
c = torch.nn.Conv2d(3, 3, 3, 3).cuda()

%timeit c(i); torch.cuda.synchronize()
```
before vs after = 498 vs 490 µs

The performance is improved I guess because, before this PR, we always call the heuristics to get the algorithm, but after this PR, we only do at the first time.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33073

Differential Revision: D20284755

Pulled By: ngimel

fbshipit-source-id: b03af37c75939ca50c2cb401c706ba26914dd10e
2020-03-05 17:06:21 -08:00
5500c3de0a Revert D20150304: [pytorch][PR] [JIT] Introduce a fake Tensor creation node for IR unit tests
Test Plan: revert-hammer

Differential Revision:
D20150304

Original commit changeset: c88f5289055a

fbshipit-source-id: 14ac0e46145e9fb4f200c6318b63edd541380aeb
2020-03-05 16:25:08 -08:00
78aebbcb88 [JIT] add other module apis (#34106)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/34106

Test Plan: Imported from OSS

Differential Revision: D20283996

Pulled By: eellison

fbshipit-source-id: 88e7bc4547e96717d6c8efe0b25ede0d198d9e68
2020-03-05 16:12:29 -08:00
2af64ba3ed Allow output to zero-strided tensors if the size is <= 1 along that dim (#34100)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/33812
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34100

Differential Revision: D20267778

Pulled By: ngimel

fbshipit-source-id: 1b84c4f6e6bf5d29c3698daa3cb71554b25c1eee
2020-03-05 16:01:33 -08:00
ccf4d69b75 [Lite Interpreter] Enable __setstate__ (#33294)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33294

1. Serialize bytecode of __setstate__ and run it when loading the model.
2. One use case is quantization. To test this use case a few operators are registered temporarily for lite interpreter. The "_" prefix registration will be removed when the operators are all migrated to mobile.

Test Plan: Imported from OSS

Differential Revision: D20162898

Pulled By: iseeyuan

fbshipit-source-id: 7a3180807bf38fbce594d86993896861f12bb58c
2020-03-05 15:24:21 -08:00
765c5b1c95 .circleci: Add CUDA 10.2 to CI (#34241)
Summary:
Basically a re-do of https://github.com/pytorch/pytorch/pull/33471

Should be safe to merge now that https://github.com/pytorch/pytorch/issues/34135 has been merged.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34241

Differential Revision: D20292711

Pulled By: seemethere

fbshipit-source-id: c508b5ef58f52aa3a263fd33b0373f31719fa0a4
2020-03-05 15:06:34 -08:00
f218842f2e [JIT] Add support for list() (#33818)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/33818

Test Plan: Imported from OSS

Differential Revision: D20121915

Pulled By: eellison

fbshipit-source-id: c6c4ef444dbf1d4134dccb28c13315e225945b64
2020-03-05 14:48:20 -08:00
479c3b0aa5 [JIT] add support for torch.norm (#33783)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33783

Fix for https://github.com/pytorch/pytorch/issues/20113

Test Plan: Imported from OSS

Differential Revision: D20121917

Pulled By: eellison

fbshipit-source-id: ffedcc40678cd80f5529ff9323088eed544e5158
2020-03-05 14:46:24 -08:00
beb4309406 [ONNX] Reduce ONNX test time on CI (#33242)
Summary:
Among all ONNX tests, ONNXRuntime tests are taking the most time on CI (almost 60%).
This is because we are testing larger models (mainly torchvision RCNNs) for multiple onnx opsets.
I decided to divide tests between two jobs for older/newer opsets. This is now reducing the test time from 2h to around 1h10mins.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33242

Reviewed By: hl475

Differential Revision: D19866498

Pulled By: houseroad

fbshipit-source-id: 446c1fe659e85f5aef30efc5c4549144fcb5778c
2020-03-05 14:38:34 -08:00
ff2731b45c Revert "Disable MNIST test in test_xla() (#34261)" (#34316)
Summary:
Should be passing now ;)
This reverts commit 4a194f89aadc7cd1d7e24622b53855cfb885da75.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34316

Reviewed By: mrshenli

Differential Revision: D20287196

Pulled By: ailzhang

fbshipit-source-id: 1cc48a11edcc48a0ec4161c94487912eba63c9a5
2020-03-05 14:27:26 -08:00
9651088228 Tuck the packing logic into Int8FCPackWeight op (#34289)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/34289

Test Plan:
```
 buck test caffe2/caffe2/quantization/server:fully_connected_dnnlowp_op_test
```

Reviewed By: csummersea

Differential Revision: D20275538

fbshipit-source-id: 699ca2a145c7c9a50b0fdab7bd68d8557a031ac0
2020-03-05 13:43:08 -08:00
9ce833879f [JIT] Introduce a fake Tensor creation node for IR unit tests (#33914)
Summary:
**Summary**
There is often a need to create a Tensor when writing IR by hand for JIT
optimisation pass unit tests. The only options for this today are real
Tensor creation functions like `aten::ones`. Any test that uses these functions
must also use the same default arguments as the Python/C++ API, which means
that all of the tests have to be updated when the API is updated. This commit
introduces a new primitive, `prim::MakeTestTensor` with schema `() -> Tensor` that
should be used in unit tests instead of real Tensor creation functions. This new
primitive has no public-facing API, so the maintenance burden is much lower.

**Testing**
This commit updates the alias analysis and DCE tests to use `prim::MakeTestTensor` instead of
`aten::rand`, `aten::ones`, and `aten::zeros`.

```
$ ./bin/test_jit
CUDA not available. Disabling CUDA and MultiCUDA tests
Note: Google Test filter = *-*_CUDA:*_MultiCUDA
[==========] Running 75 tests from 1 test case.
[----------] Global test environment set-up.
[----------] 75 tests from JitTest
[ RUN      ] JitTest.ADFormulas
[       OK ] JitTest.ADFormulas (82 ms)
[ RUN      ] JitTest.Attributes
[       OK ] JitTest.Attributes (0 ms)
...
...
...
[ RUN      ] JitTest.LiteInterpreterPrim
[       OK ] JitTest.LiteInterpreterPrim (0 ms)
[ RUN      ] JitTest.LiteInterpreterLoadOrigJit
[       OK ] JitTest.LiteInterpreterLoadOrigJit (2 ms)
[----------] 75 tests from JitTest (150 ms total)

[----------] Global test environment tear-down
[==========] 75 tests from 1 test case ran. (150 ms total)
[  PASSED  ] 75 tests.
```

**Fixes**
This pull request fixes https://github.com/pytorch/pytorch/issues/33500.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33914

Differential Revision: D20150304

Pulled By: SplitInfinity

fbshipit-source-id: c88f5289055a02dc20b7a5dcdf87469f9816d020
2020-03-05 12:42:42 -08:00
75d29f8d3e Allow converting IValue to vector<string> (#34269)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34269

follow up for https://github.com/pytorch/pytorch/pull/16519

Test Plan: unit tests

Reviewed By: houseroad

Differential Revision: D20261495

fbshipit-source-id: 947f3cbd469d9258ec2dbb36cb68efe15a3b19eb
2020-03-05 12:31:23 -08:00
3a4bac5c76 Throw a proper error when parsing local variable annotations without assignments (#34133)
Summary:
Currently, putting `outputs: List[Tensor]` instead of `outputs: List[Tensor] = []` in your JITed code results in:
```
Traceback (most recent call last):
  File "custom_lstms.py", line 453, in <module>
    test_script_stacked_bidir_rnn(5, 2, 3, 7, 4)
  File "custom_lstms.py", line 404, in test_script_stacked_bidir_rnn
    rnn = script_lstm(input_size, hidden_size, num_layers, bidirectional=True)
  File "custom_lstms.py", line 62, in script_lstm
    other_layer_args=[LSTMCell, hidden_size * dirs, hidden_size]))
  File "/home/apaszke/pytorch/torch/jit/__init__.py", line 1267, in script
    return torch.jit._recursive.create_script_module(obj, torch.jit._recursive.infer_methods_to_compile)
  File "/home/apaszke/pytorch/torch/jit/_recursive.py", line 305, in create_script_module
    return create_script_module_impl(nn_module, concrete_type, stubs_fn)
  File "/home/apaszke/pytorch/torch/jit/_recursive.py", line 348, in create_script_module_impl
    script_module = torch.jit.RecursiveScriptModule._construct(cpp_module, init_fn)
  File "/home/apaszke/pytorch/torch/jit/__init__.py", line 1612, in _construct
    init_fn(script_module)
  File "/home/apaszke/pytorch/torch/jit/_recursive.py", line 340, in init_fn
    scripted = create_script_module_impl(orig_value, sub_concrete_type, infer_methods_to_compile)
  File "/home/apaszke/pytorch/torch/jit/_recursive.py", line 348, in create_script_module_impl
    script_module = torch.jit.RecursiveScriptModule._construct(cpp_module, init_fn)
  File "/home/apaszke/pytorch/torch/jit/__init__.py", line 1612, in _construct
    init_fn(script_module)
  File "/home/apaszke/pytorch/torch/jit/_recursive.py", line 340, in init_fn
    scripted = create_script_module_impl(orig_value, sub_concrete_type, infer_methods_to_compile)
  File "/home/apaszke/pytorch/torch/jit/_recursive.py", line 348, in create_script_module_impl
    script_module = torch.jit.RecursiveScriptModule._construct(cpp_module, init_fn)
  File "/home/apaszke/pytorch/torch/jit/__init__.py", line 1612, in _construct
    init_fn(script_module)
  File "/home/apaszke/pytorch/torch/jit/_recursive.py", line 340, in init_fn
    scripted = create_script_module_impl(orig_value, sub_concrete_type, infer_methods_to_compile)
  File "/home/apaszke/pytorch/torch/jit/_recursive.py", line 348, in create_script_module_impl
    script_module = torch.jit.RecursiveScriptModule._construct(cpp_module, init_fn)
  File "/home/apaszke/pytorch/torch/jit/__init__.py", line 1612, in _construct
    init_fn(script_module)
  File "/home/apaszke/pytorch/torch/jit/_recursive.py", line 340, in init_fn
    scripted = create_script_module_impl(orig_value, sub_concrete_type, infer_methods_to_compile)
  File "/home/apaszke/pytorch/torch/jit/_recursive.py", line 317, in create_script_module_impl
    stubs = stubs_fn(nn_module)
  File "/home/apaszke/pytorch/torch/jit/_recursive.py", line 511, in infer_methods_to_compile
    stubs.append(make_stub_from_method(nn_module, method))
  File "/home/apaszke/pytorch/torch/jit/_recursive.py", line 41, in make_stub_from_method
    return make_stub(func)
  File "/home/apaszke/pytorch/torch/jit/_recursive.py", line 34, in make_stub
    ast = torch.jit.get_jit_def(func, self_name="RecursiveScriptModule")
  File "/home/apaszke/pytorch/torch/jit/frontend.py", line 173, in get_jit_def
    return build_def(ctx, py_ast.body[0], type_line, self_name)
  File "/home/apaszke/pytorch/torch/jit/frontend.py", line 206, in build_def
    build_stmts(ctx, body))
  File "/home/apaszke/pytorch/torch/jit/frontend.py", line 129, in build_stmts
    stmts = [build_stmt(ctx, s) for s in stmts]
  File "/home/apaszke/pytorch/torch/jit/frontend.py", line 129, in <listcomp>
    stmts = [build_stmt(ctx, s) for s in stmts]
  File "/home/apaszke/pytorch/torch/jit/frontend.py", line 181, in __call__
    return method(ctx, node)
  File "/home/apaszke/pytorch/torch/jit/frontend.py", line 294, in build_AnnAssign
    rhs = build_expr(ctx, stmt.value)
  File "/home/apaszke/pytorch/torch/jit/frontend.py", line 180, in __call__
    raise UnsupportedNodeError(ctx, node)
  File "/home/apaszke/pytorch/torch/jit/frontend.py", line 116, in __init__
    source_range = ctx.make_range(offending_node.lineno,
AttributeError: 'NoneType' object has no attribute 'lineno'
```

This patch makes the error message more reasonable:
```
torch.jit.frontend.UnsupportedNodeError: annotated assignments without assigned value aren't supported:
  File "custom_lstms.py", line 221
        # type: (Tensor, Tuple[Tensor, Tensor]) -> Tuple[Tensor, Tuple[Tensor, Tensor]]
        inputs = reverse(input.unbind(0))
        outputs: List[Tensor]
        ~ <--- HERE
        for i in range(len(inputs)):
            out, state = self.cell(inputs[i], state)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34133

Differential Revision: D20249076

Pulled By: ezyang

fbshipit-source-id: 40ec34ad38859f9fe56f379d3f8d08644b00fab9
2020-03-05 11:23:07 -08:00
ed11e2536a [pytorch_ci] Skip determination tests in rocm
Summary: I don't know why, but this segfaults on rocm.

Test Plan: Can only be tested on master

Reviewed By: mrshenli

Differential Revision: D20286011

fbshipit-source-id: dde952449bf54ae459d36020f3e3db6fa087b39f
2020-03-05 11:23:02 -08:00
e907128caf [ROCm] Enable BFloat16 type for pooling ops (#34166)
Summary:
This PR enables bfloat16 type for pooling ops on ROCm. Also adds bfloat16 implementation of atomicAdd since pooling ops use it.

Note: Changes in the lambda function blocks is only indentation as it is now wrapped inside `AT_SKIP_BFLOAT16_IF_NOT_ROCM` macro.

iotamudelta ezyang bddppq
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34166

Differential Revision: D20263421

Pulled By: ezyang

fbshipit-source-id: 3f4199ec57522e638ec29f45e22c6ec919b7816d
2020-03-05 11:20:54 -08:00
8216d9ae64 ONNX Export Support for NLLLoss (#33509)
Summary:
Adding ONNX export support for torch.nn.NLLLoss().
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33509

Reviewed By: hl475

Differential Revision: D20052212

Pulled By: houseroad

fbshipit-source-id: 62efcff4efa1e0e97c65ad1b670c2fc1da08d28f
2020-03-05 11:13:21 -08:00
e642a65bea [pytorch][CI] add e2e mobile custom build jobs to CI (#34184)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34184

Add mobile custom build with static dispatch & dynamic dispatch to CI.
Most of mobile code analysis CI should be covered by the custom build +
dynamic dispatch flow, so changing it to running on master only.

Test Plan: Imported from OSS

Differential Revision: D20241774

Pulled By: ljk53

fbshipit-source-id: f34c5748735c536ab6b42c8eb1429d8bbdaefd62
2020-03-05 10:26:45 -08:00
d98bd5e1f5 [test all] Back out "Revert D20171428: [profiler] fix chrome tracing for profiler run with cuda"
Summary:
There was an error in
https://github.com/pytorch/pytorch/pull/30724/files that resulted in
export_chrome_trace generating invalid JSON. This only came up when the
profiler is run with use_cuda=True from what it looks like. In the future, we
should have tests that ensure we generate valid JSON because we no longer use
the json library.
ghstack-source-id: 99508836

Test Plan: Added a unit test.

Differential Revision: D20237040

fbshipit-source-id: 510befbdf4ec39632ac56544afcddee6c8cc3aca
2020-03-05 09:05:56 -08:00
4a194f89aa Disable MNIST test in test_xla() (#34261)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/34261

Test Plan: Imported from OSS

Differential Revision: D20260350

Pulled By: mrshenli

fbshipit-source-id: b92a6b79e59bdfdf8e68b5dd73f87ea1dfd0daed
2020-03-05 07:55:52 -08:00
Jie
2b79bab029 [CUDA_FUSER] Fork CUDA fuser (#33527)
Summary:
Separating CUDA fuser from CPU fuser.

1. New node in IR - prim::CudaFusionGroup:
   This enables the cuda fuser to co-exist along side the old fuser. Allows us
   to incrementally build and expand cuda fuser.

2. copied FuseGraph optimization passes to CudaFuserGraph:
   We will re-factor & reuse Chunk/Concat in the old fuser logic, which is
   handled in the optimization pass at this moment. Unfortunately many code in
   the pass is tightly binded with the legacy fuser, which makes code sharing
   difficult.
   The CudaFusionGraph will support only a subset of operations comparing to
   legacy fuser (CUDA only). It is registered as a custom pass post fusion via
     ```torch._C._jit_register_cuda_fuser()```
   To have it in effect, you should also turn off fusion on GPU via
     ```torch._C._jit_override_can_fuse_on_gpu(False)```

3. We don't have codegen in this PR yet (WIP). Currently we just fall back to
   the old fuser.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33527

Differential Revision: D20171598

Pulled By: ZolotukhinM

fbshipit-source-id: 9a3c0f06f46da7eaa80ae7551c04869f5b03ef71
2020-03-04 20:25:08 -08:00
e132047f1b [JIT] fix alias assertion (#34268)
Summary:
[This check](019ffdca31/torch/csrc/jit/ir/alias_analysis.cpp (L772)) wasn't being triggered for None outputs of tuples, because `mustBeNone` would return false if `num_outputs != 1`.  This caused an assertion to fail in alias analysis. It's kind of a convoluted case to repro and I wasn't able to make a succinct one, but I tested internally and it fixed the bug.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34268

Differential Revision: D20261539

Pulled By: eellison

fbshipit-source-id: 95edea10e2971727cfd3f3bc2b6bdf9dbadca6a9
2020-03-04 19:00:58 -08:00
e2ddf935bb Run RPC JIT tests with variable type hints only in Python >=3.6 (#34284)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34284

Python 3.5 only supports function type hints.
Variable type hints are introduced in Python 3.6.
So these tests with JIT type hints will fail with "Syntax Error" in Python 3.5 environment.

ghstack-source-id: 99542199

Test Plan: `

Differential Revision: D7348891

fbshipit-source-id: c4c71ac021f35b5e6f7ce4d3e6af10dd1d2600cc
2020-03-04 18:59:08 -08:00
c62de4286e Add test to verify dist_autograd doesn't populate .grad field. (#33949)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33949

ghstack-source-id: 99419830

Test Plan: waitforbuildbot

Differential Revision: D20165254

fbshipit-source-id: ef4413637b1568d81e4aca053838230025df6bba
2020-03-04 17:08:48 -08:00
e1c6f93f14 Clean warning message (#34143)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/34143

Test Plan: Imported from OSS

Differential Revision: D20228174

Pulled By: VitalyFedyunin

fbshipit-source-id: 7ab873e87be8621b0f72e8300942fd82cbc19b29
2020-03-04 15:02:19 -08:00
1546d2afeb [pytorch_ci] Don't run determination tests in py35
Test Plan: Can only really be tested in PyTorch master

Reviewed By: mrshenli

Differential Revision: D20260023

fbshipit-source-id: b5444c376894bfccd6524cf04a71cf76eea72275
2020-03-04 14:23:40 -08:00
e236e15934 [quant] Run weight_post_process for QAT (#33852)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33852

This fixes an issue for QAT models. During eval if we call `prepare_qat` and `convert` before calling `load_state_dict` it throws an error because the weight info (num channels) is not updated in the observer module.
It is not an issue for per-tensor case

Fixes issue #33830

Test Plan:
python test/test_quantization.py EagerModePostTrainingQuantTest.test_eval_after_train
python test/test_quantization.py EagerModeQuantizationAwareTrainingTest.test_eval_after_train

Imported from OSS

Differential Revision: D20212996

fbshipit-source-id: a04af8fe4df2e555270ae4d6693f5777d86f8a46
2020-03-04 14:01:32 -08:00
d59e036f4d Revert D20194092: Add support to dump unsupported ops. Add lite_interpter_load test.
Test Plan: revert-hammer

Differential Revision:
D20194092

Original commit changeset: 0d596cd02043

fbshipit-source-id: 17b4bae27543f231bd6c12d90368d399ca55ebdf
2020-03-04 13:53:58 -08:00
17a5c67796 Add support to dump unsupported ops. Add lite_interpter_load test. (#34072)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34072

This diff helps check all the ops not supported by lite_interpreter.
Helpful mainly to find all the ops that need to be added instead of adding them
one by one.

Test Plan:
buck run caffe2/binaries:lite_interpreter_model_load --
--model=<bytecode-model-path>

Reviewed By: iseeyuan

Differential Revision: D20194092

fbshipit-source-id: 0d596cd0204308027194af7ed738551d0c32a374
2020-03-04 13:18:12 -08:00
385067ed4f [pytorch][cmake] improve build mobile with host toolchain (#34187)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34187

Noticed that a recent PR broke Android/iOS CI but didn't break mobile
build with host toolchain. Turns out one mobile related flag was not
set on PYTORCH_BUILD_MOBILE code path:
```
"set(INTERN_DISABLE_MOBILE_INTERP ON)"
```

First, move the INTERN_DISABLE_MOBILE_INTERP macro below, to stay with
other "mobile + pytorch" options - it's not relevant to "mobile + caffe2"
so doesn't need to be set as common "mobile" option;

Second, rename PYTORCH_BUILD_MOBILE env-variable to
BUILD_PYTORCH_MOBILE_WITH_HOST_TOOLCHAIN - it's a bit verbose but
becomes more clear what it does - there is another env-variable
"BUILD_PYTORCH_MOBILE" used in scripts/build_android.sh, build_ios.sh,
which toggles between "mobile + pytorch" v.s. "mobile + caffe2";

Third, combine BUILD_PYTORCH_MOBILE_WITH_HOST_TOOLCHAIN with ANDROID/IOS
to avoid missing common mobile options again in future.

Test Plan: Imported from OSS

Differential Revision: D20251864

Pulled By: ljk53

fbshipit-source-id: dc90cc87ffd4d0bf8a78ae960c4ce33a8bb9e912
2020-03-04 11:43:16 -08:00
93990bab58 Make use of our S3 mirror if Yann Lecunn's website is not accessible (#34215)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34215

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Differential Revision: D20251538

Pulled By: ezyang

fbshipit-source-id: c419f0ce869aca4dede7e37ebd274a08632d10bf
2020-03-04 11:35:34 -08:00
67608cc018 Fix MKLDNN conv2d 5d weight handling (#34115)
Summary:
Effectively backporting c5c00c119f before that PR lands

The bug didn't manifesting itself earlier because MkldnnConv2d constructor didn't reorder the weights. So the issue was arising only on second serialization/deserialization. This also fixes the constructor to deliver better perf right away.

Note, that I still serialize 5d tensor - it was the previous behavior, we have to handle it anyway and with https://github.com/pytorch/pytorch/issues/32422 the output of `mkldnn_reorder_conv2d_weight` will always be 4d.

cc pinzhenx
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34115

Reviewed By: wanchaol

Differential Revision: D20224685

Pulled By: dzhulgakov

fbshipit-source-id: 24ca9227c4eb4c139096a64ae348808d7478d7dc
2020-03-04 11:26:38 -08:00
9dd5d51b01 [ATen] Exclude CUDA tests when running basic under valgrind (#34181)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/34181

Test Plan: CI

Reviewed By: orionr, seemethere

Differential Revision: D20241021

fbshipit-source-id: a7371afc45acc2c07a36c8216036338e14170a56
2020-03-04 11:24:33 -08:00
8269c4f3d3 Added nullptr check for pthradpool_get_threads_count (#34087)
Summary:
We get seg fault without this in using XNNPACK.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34087

Differential Revision: D20199787

Pulled By: kimishpatel

fbshipit-source-id: d3d274e7bb197461632b21688820cd4c10dcd819
2020-03-04 11:10:53 -08:00
ac6e75a165 Revert D20195053: [pytorch][PR] Add API for listing functions overridable by __torch_function__
Test Plan: revert-hammer

Differential Revision:
D20195053

Original commit changeset: 1585f4e405f5

fbshipit-source-id: 3c1aab9c60e3138d40d200ae4238bda0cddf8896
2020-03-04 10:13:54 -08:00
78b81dad83 [Dist Autograd][Better Engineering] Enhanced Error Reporting in Dist Autograd/RPC (#34179)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34179

Fixes: https://github.com/pytorch/pytorch/issues/27644

Test Plan: Asserted `test_backward_autograd_engine_error` throws an exception with node information.

Differential Revision: D20238150

fbshipit-source-id: a49b279b77416a7e0e09043aa44ed616023d8e70
2020-03-04 10:13:49 -08:00
45b8c8dbcb [torch] Fix sign-compare warning in torch::utils::rnn:pack_sequence (#34185)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34185

ArrayRef<T>::size() is size_t

Test Plan: CI

Reviewed By: EscapeZero

Differential Revision: D20241552

fbshipit-source-id: 73cd062db810ebc5a4e34e094dfe6c7e6571ef2d
2020-03-04 10:13:45 -08:00
39f78db7ec optimize UpSampleNearest 1d 2d and 3d performance on CPU (#31452)
Summary:
This PR aims at improving `UpSample` performance with `mode='nearest'` on 1D 2D and 3D, both inference and training are covered. Current implementation from 'ATen' doesn't have parallelization.

1. single socket inference speedup for 1d, 2d and 3d: **63x, 57x, 46x**.
2. single core inference speedup for 1d, 2d and 3d: **5.9x, 4.6x, 3.4x**.
3. dual sockets training speedup for 1d, 2d and 3d: **38x, 33x, 65x**
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31452

Differential Revision: D20077828

Pulled By: VitalyFedyunin

fbshipit-source-id: a7815cf2ae344696067d2ec63bd4f4e858eaafff
2020-03-04 10:13:41 -08:00
112cecc440 Remove the use of macros when defining division between integers (#34104)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/34104

Test Plan: Imported from OSS

Differential Revision: D20222676

Pulled By: VitalyFedyunin

fbshipit-source-id: fb026ce7843e7931324ea82542fb07784e40efdb
2020-03-04 10:13:36 -08:00
438f4ea0ac Cleaner implementation of bitwise operations of integeral types (#33849)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33849

For integral types, there is no need to manipulate with
`reinterpret_cast` and therefore a cleaner implementation is available.
This might also be helpful on some less optimized compilers or on a less optimized arch (while a
test on gcc 8.3 x64 shows no difference in performance).

Test Plan: Imported from OSS

Differential Revision: D20222675

Pulled By: VitalyFedyunin

fbshipit-source-id: 875890d1479f8abab4c4a19d934fe9807d12dfd2
2020-03-04 10:13:32 -08:00
3a3fcbbc39 Use templates instead of macros when defining bitwise operators. (#33835)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/33835

Test Plan: Imported from OSS

Differential Revision: D20131414

Pulled By: VitalyFedyunin

fbshipit-source-id: ec7eb7cb14e037a277cc8d71d5c9df27abf51752
2020-03-04 10:11:36 -08:00
78ad3dc174 Fix Lint (#34218)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/34218

Test Plan: Imported from OSS

Differential Revision: D20249788

Pulled By: mrshenli

fbshipit-source-id: 5ca2acaff5344fc4455c70af60576f8e93e54cbf
2020-03-04 09:48:57 -08:00
6f52562e75 [quant][graphmode] Add add_relu pattern in skip values (#32816)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32816

att

Test Plan:
python test/test_jit.py

Imported from OSS

Differential Revision: D20208786

fbshipit-source-id: ef84b77f46f88b192a75c123aabaa203836a7dfb
2020-03-04 09:36:02 -08:00
22506ae71d Reduce code duplication in OperatorEntry by keying hash map on optional<DispatchKey> (#33817)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33817

Then, nullopt denotes catch all, whereas everything else is specific to
a DispatchKey.  I can delete the second copy of methods when I do this.
This refactor should be pushed all the way to the frontend but I am doing
it one step at a time.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Differential Revision: D20125163

Pulled By: ezyang

fbshipit-source-id: 026075a4bab81b0bd88b07f0800f6e6bbeb2166a
2020-03-04 08:57:22 -08:00
c688eb28a2 Minor fix for quantizing the Ads complex model
Summary:
Remove Int8Relu in quantized model
Suppress log warnings if verbose is false

Test Plan: TBD

Reviewed By: yinghai

Differential Revision: D20202474

fbshipit-source-id: 995ef8e665d8edeee810eedac831440b55271a7b
2020-03-04 08:34:59 -08:00
5f4a01b2ea Update MAGMA to 2.5.2 for Windows (#34205)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/34205

Differential Revision: D20248224

Pulled By: soumith

fbshipit-source-id: f5e0fe06aa8f8ee551abe45db1d55d06e95ab928
2020-03-04 08:28:09 -08:00
f6c883ccea TH: Defer to ATen's AVX detection code (#34088)
Summary:
As per https://github.com/pytorch/pytorch/issues/22338#issuecomment-593028168, this removes the AVX detection code from TH. Now the environment variable `ATEN_CPU_CAPABILITY` is the only setting needed to disable AVX/AVX2.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34088

Differential Revision: D20236039

Pulled By: ezyang

fbshipit-source-id: eecec64b41a7a6ca7e42c1c2762032eb47af535c
2020-03-04 08:22:02 -08:00
fdd771c90f Make tracing in code gen optional (#33715)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33715

Tracing codes depend on the full JIT, which is not available in lite interpreter. Use `-c pt.disable_gen_tracing=1` to turn off generating tracing part.
ghstack-source-id: 99252322

Test Plan:
```
buck build xplat/caffe2:torch -c pt.disable_gen_tracing=1
```
The tracing part of generated/VariableType_?.cpp will not be generated.

Reviewed By: smessmer

Differential Revision: D19684577

fbshipit-source-id: a1e5b80eca5e51c7bf72b5cc8f0e36c2135fabc2
2020-03-04 08:16:31 -08:00
790274bff2 [caffe2] Fix signed unsigned comparison warning (#34161)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/34161

Test Plan: CI

Reviewed By: EscapeZero

Differential Revision: D20232087

fbshipit-source-id: 09dc8d452c5923cd2941e0cc01eac7a6677b38e8
2020-03-04 08:02:44 -08:00
6d78882158 Add layout.html to template for stable docs (#33770)
Summary:
When docs are built, conf.py points to a _templates-stable/layout.html that does not exist.
Adding this file here so future stable docs will build with Google Analytics tags and without the unstable able that is in _templates/layout.html
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33770

Differential Revision: D20164895

Pulled By: jlin27

fbshipit-source-id: 5fca9f9b825b1484dab52e2b2d91f92ae6372371
2020-03-04 03:14:52 -08:00
fc6dce6033 [c10] Fix TORCH_INTERNAL_ASSERT_DEBUG_ONLY MSVC bug (#34173)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/34173

Test Plan:
Temporarily change `AT_ASSERTM` to `TORCH_INTERNAL_ASSERT_DEBUG_ONLY` to test MSVC fix.

```
buck test mode/opt //caffe2/caffe2:caffe2_test_cpu -- 'BlobTest'
```

& CI

Reviewed By: yinghai

Differential Revision: D20235886

fbshipit-source-id: 2b7d618e924a0ede95f4a6b8f60cc08e9d58b09d
2020-03-04 02:45:35 -08:00
f097ca503d Add and test training in lite interpreter. (#32359)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/32359

Test Plan: Imported from OSS

Differential Revision: D19450614

Pulled By: iseeyuan

fbshipit-source-id: 6bafff39d7880a5b7fb9cd70c33a4e584812be12
2020-03-03 23:33:43 -08:00
2ba74b741e Add backward Int8Quantize shape inference (#34152)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34152

Propagate the input shape of Int8Quantize backwards.

Test Plan:
```
buck test caffe2/caffe2/opt:bound_shape_inference_test
```

Reviewed By: csummersea

Differential Revision: D20231521

fbshipit-source-id: a77c61b0d5bc570241e62553cecd9ff38553ff44
2020-03-03 22:04:25 -08:00
57c1b80ec2 [pytorch]Migrate _th_ger to Aten and kill resize_scalar in codegen (#33792)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/33792

Test Plan: Imported from OSS

Differential Revision: D20107158

Pulled By: glaringlee

fbshipit-source-id: bceddb2d39d3abf36f277daba537677312449c9c
2020-03-03 20:27:54 -08:00
7d01888a75 [JIT] Register rpc.rpc_async(..) as a JIT operator (#33329)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33329

# Use case

```
torch.jit.script
def send_rpc_async(dst_worker_name, user_callable_qual_name, tensor):
    # type: (str, str, Tensor) -> None
    rpc._rpc_async_torchscript(
        dst_worker_name, user_callable_qual_name, args=(tensor,)
    )
```

# Problem

```
torch.jit.frontend.NotSupportedError: keyword-arg expansion is not supported:
  File "/data/users/shihaoxu/fbsource/fbcode/buck-out/dev/gen/caffe2/test/distributed/rpc/rpc_spawn#binary,link-tree/torch/distributed/rpc/api.py", line 722
    args = args if args else ()
    kwargs = kwargs if kwargs else {}
    fut = _invoke_rpc_torchscript(to, qualified_name, *args, **kwargs)
                                                               ~~~~~~ <--- HERE
    return fut
```

# Solution

Register `rpc.rpc_async(..)` as a JIT operator to handle variable-length argument list.

# Plan

This PR is the required changes to make `rpc.rpc_async(..)` a JIT prim operator, which can dynamically handle different number of arguments.

- Register "prim::rpc_async" as a `Symbol` in "interned_string.h"
- Add a if branch in "python_sugared_value.cpp" `toSugarValue(py::object, ..)` entry utility function to set up how JIT frontend convert `torch.distributed.rpc.rpc_async(..)` Python function (Python object) into a `SpecialFormValue` (IR SugaredValue).
- Add a switch case for "prim::rpc_aynsc" Symbol in "ir_emitter.cpp" and `emitApplySpecialForm(..)` to set up how JIT compiler provides inputs to the "prim::rpc_aynsc" Operator.
- Register "prim::rpc_async" as a `jit::Operator` and provide implementation in "register_distributed_ops.cpp".

Notice, since the distributed module is an optional part when building PyTorch. The code to be added in this PR should be wrapped within preprocessing maco.
```
#ifdef USE_DISTRIBUTED
new code here
#endif
```

Test Plan:
Items that need to be confirmed in the test cases

https://fb.quip.com/DCvdA9ZLjeO0

```
buck test mode/dev-nosan //caffe2/test/distributed/rpc/jit:rpc_fork

buck build mode/dev-nosan //caffe2/test/distributed/rpc/jit:rpc_fork  \
\
&& buck-out/gen/caffe2/test/distributed/rpc/jit/rpc_fork\#binary.par -r test_call_python_function_remotely_from_script_not_supported
```

```
buck test mode/dev-nosan //caffe2/test/distributed/rpc/jit:rpc_spawn
```

```
buck test mode/dev-nosan //caffe2/caffe2/python/operator_test:layer_norm_op_test-2.7 -- test_layer_norm_op_jit
```

Differential Revision: D5738300

fbshipit-source-id: a4604fe762e00be062dc8232ca9790df31fb2074
2020-03-03 19:57:42 -08:00
9b39ad7f2c [jit] Fix iOS build (#34180)
Summary:
`unpickler.cpp` depends on the mobile type parser all the time, so include it regardless of whether it's a mobile build or not
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34180

Pulled By: driazati

Differential Revision: D20241881

fbshipit-source-id: a998dd2b3f1c7f58e55bb7851dc595c8ddf9eacb
2020-03-03 19:44:43 -08:00
3c042a6ab9 [pytorch][mobile] support for custom mobile build with dynamic dispatch (#34055)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34055

Enable custom mobile build with dynamic dispatch for OSS build.

It calls a python util script to calculate transitive dependencies from
the op dependency graph and the list of used root ops, then pass the
result as the op registration whitelist to aten codegen, so that only
these used ops are registered and kept at link time.

For custom build with dynamic dispatch to work correctly, it's critical
to have the accurate list of used ops. Current assumption is that only
those ops referenced by TorchScript model are used. It works well if
client code doesn't call libtorch API (e.g.  tensor methods) directly;
otherwise the extra used ops need to be added to the whitelist manually,
as shown by the HACK in prepare_model.py.

Also, if JIT starts calling extra ops independent of specific model,
then the extra ops need to be added to the whitelist as well.

Verified the correctness of the whole process with MobileNetV2:
```
TEST_CUSTOM_BUILD_DYNAMIC=1 test/mobile/custom_build/build.sh
```

Test Plan: Imported from OSS

Reviewed By: bhosmer

Differential Revision: D20193327

Pulled By: ljk53

fbshipit-source-id: 9d369b8864856b098342aea79e0ac8eec04149aa
2020-03-03 19:25:16 -08:00
e5bbd23ca7 [quant][graphmode] Skip quantizing input and output in matched module (#32814)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32814

We skip quantization for the intermediate values for patterns like `Conv - ReLU`,
but currently we didn't skip quantizing the input/output of the graphs of matched modules,
since we now changed the way we add observers, this also needs to be updated.

Test Plan:
python test/test_jit.py -- 'TestJit.test_insert_observers_skip_values'

Imported from OSS

Differential Revision: D20208785

fbshipit-source-id: ce30f2c4c8ce737500d0b41357c80ec8b33aecf9
2020-03-03 18:38:36 -08:00
7cee787a19 [pytorch_ci] Python target determinator (#33577)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33577

Pull Request resolved: https://github.com/pytorch/pytorch/pull/33221

This will make it so that if a pull request is just pure Python files, then we'll only run the Python tests that are connected to the dependency graph of the touched files.

Assumptions made:
- the Python code does not do dynamic imports
- test_X.py never imports from test_Y.py

Right now this is only done for test_nn (presumably the largest test entrypoint), but it's not much more work to do it for all the other test entrypoints too.

Test Plan:
CircleCI results when touching just a few Python files:
- pytorch_macos_10_13_py3_test: 41 ->13 minutes https://circleci.com/gh/pytorch/pytorch/4550574?utm_campaign=vcs-integration-link&utm_medium=referral&utm_source=github-build-link
- pytorch_windows_vs2019_py36_cuda10.1_test1: 11 -> 2 minutes https://circleci.com/gh/pytorch/pytorch/4550846?utm_campaign=vcs-integration-link&utm_medium=referral&utm_source=github-build-link
- pytorch_windows_vs2019_py36_cuda10.1_test2: 51 -> 21 minutes https://circleci.com/gh/pytorch/pytorch/4550845?utm_campaign=vcs-integration-link&utm_medium=referral&utm_source=github-build-link
- pytorch_linux_xenial_py3_6_gcc5_4_test: 41 -> 14 minutes https://circleci.com/gh/pytorch/pytorch/4550543?utm_campaign=vcs-integration-link&utm_medium=referral&utm_source=github-build-link

Differential Revision: D20009089

fbshipit-source-id: 41708cc301d1c866eb92a04421d8346feb0e3cb5
2020-03-03 18:01:12 -08:00
7c20578794 NNPI op mapping correct SpatialBN NNPI op name (#34176)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34176

Wrong operator name for the NNPI SpatialBN

Test Plan: flow canary

Reviewed By: hyuen

Differential Revision: D20237933

fbshipit-source-id: dfde658dcbf2482320e36d549f7d83c27df264a0
2020-03-03 17:57:28 -08:00
a19db54b36 [Redo][ATen] Remove AT_ASSERTM from Blob::free_() (#34168)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34168

Redo D19153199. It was reverted because it broke CI, due to the change of `AT_ASSERTM` to `TORCH_INTERNAL_ASSERT_DEBUG_ONLY`. Two problems:
1) bug in `TORCH_INTERNAL_ASSERT_DEBUG_ONLY` about MSVC. I'm sending another diff to fix this bug.
2) BlobTest was expecting `Blob::template Get<T>()` to throw when there is a type mismatch.

For now I'll leave `AT_ASSERTM` as it is.

Test Plan:
```
buck test mode/dev //caffe2/caffe2:caffe2_test_cpu -- 'BlobTest' --run-disabled
buck test mode/opt //caffe2/caffe2:caffe2_test_cpu -- 'BlobTest' --run-disabled
```

Reviewed By: yinghai

Differential Revision: D20235225

fbshipit-source-id: 594dad97c03c419afaa8f9023408bc5a119b3cfa
2020-03-03 17:54:05 -08:00
31cc311143 Expose CUDACachingAllocator raw_alloc and raw_delete to python (#33860)
Summary:
This PR aims to improve the interoperability with [CuPy](https://github.com/cupy/cupy/pulls).

Instead of having two separate and conflicting memory pools. With this PR, CuPy can directly alloc memory from the PyTorch allocator by means of this proposal https://github.com/cupy/cupy/pull/3126

We would like to gather feedback to know if this approach makes sense for PyTorch, or other alternative designs.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33860

Differential Revision: D20212788

Pulled By: ngimel

fbshipit-source-id: bc1e08a66da1992d26021147bf645dc65239581c
2020-03-03 17:50:11 -08:00
4edff32f81 [c10] Fix typo in __assert_fail noreturn modifier guard (#34157)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34157

`[[noreturn]` only conficts with CUDA __asert_fail defition if clang is used if host compiler

Test Plan: CI

Reviewed By: EscapeZero

Differential Revision: D20232088

fbshipit-source-id: 7182c28a15278e03175865cd0c87410c5de5bf2c
2020-03-03 17:25:25 -08:00
99e211e661 [jit] Add type tags to lists/dicts in pickle (#33255)
Summary:
Stacked PRs
 * #33474 - [jit] Remove list specializations from pickler
 * **#33255 - [jit] Add type tags to lists/dicts in pickle**

This adds a global call to `torch.jit._pickle.restore_type_tags` for
lists and dicts so that we can preserve their types after serialization.
](https://our.intern.facebook.com/intern/diff/19868637/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33255

Pulled By: driazati

Reviewed By: xman1979, Tianshu-Bao

Differential Revision: D19868637

fbshipit-source-id: 2f1826e6679a786ca209198690269f399a542c04
2020-03-03 16:48:21 -08:00
7da24b36b1 Apply clang-format to RPC files (#34139)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/34139

Test Plan: Imported from OSS

Differential Revision: D20227342

Pulled By: mrshenli

fbshipit-source-id: 01b478bde1f6a51f69eb5277fa90ba6ac2d4b5dc
2020-03-03 16:44:35 -08:00
3af0dffe84 Use double quotes in C++ to stay consistent with Python RPC docs (#34095)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/34095

Test Plan: Imported from OSS

Differential Revision: D20227343

Pulled By: mrshenli

fbshipit-source-id: 69c556beee1f9e944eb1053b5ff0ac368dd99c60
2020-03-03 16:44:30 -08:00
f1085a8e41 Improve ProcessGroup RpcBackendOptions Constructor API (#34081)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34081

Before this commit, applications have to do the following to configure
number of threads in ProcessGroup RPC backend:

```
op = ProcessGroupRpcBackendOptions()
op.rpc_timeout = rpc_timeout
op.init_method = init_method
op.num_send_recv_threads = 32
init_rpc(...., rpc_backend_options=op)
```

After this commit, it can be simplified to:

```
init_rpc(...., rpc_backend_options=ProcessGroupRpcBackendOptions(num_send_recv_threads=32))
```

Fixes #34075

Test Plan: Imported from OSS

Differential Revision: D20227344

Pulled By: mrshenli

fbshipit-source-id: def4318e987179b8c8ecca44d7ff935702c8a6e7
2020-03-03 16:43:29 -08:00
9d1c971b11 [Aten] Suppress valgrind leaks in libcuda (#34169)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34169

Valgrind have no insight how memory is being initialized by ioctls()

Test Plan: CI

Reviewed By: seemethere

Differential Revision: D20235974

fbshipit-source-id: 46413afa4842e7d42582bbbda903438b1d98691f
2020-03-03 16:00:17 -08:00
1beb309e03 Make DEBUG == REL_WITH_DEB_INFO on CUDA build (#34153)
Summary:
Related issue: https://github.com/pytorch/pytorch/issues/34079

I don't know how much we care about the difference between `-G` and `-lineinfo` in `DEBUG` vs `REL_WITH_DEB_INFO`, but since `-G` never worked, let's just use `-lineinfo` on both `DEBUG` and `REL_WITH_DEB_INFO`. This would resolve the failure in `DEBUG=1` build. Locally tested to work.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34153

Reviewed By: ljk53

Differential Revision: D20232049

Pulled By: ngimel

fbshipit-source-id: 4e48ff818850ba911298b0cc159522f33a305aaa
2020-03-03 15:07:42 -08:00
cb3905e8cf .circleci: Re-do run nightly pipelines on tag (#34148)
Summary:
Commit that this commit relied on was found to be causing issues with
valgrind https://github.com/pytorch/pytorch/issues/33471

Re-does https://github.com/pytorch/pytorch/issues/34078 after revert.

This reverts commit 1aff3e2dd3c3937aa1fedbfeee2143cfca25abcc.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34148

Differential Revision: D20234451

Pulled By: seemethere

fbshipit-source-id: cb5e496a3f761beeeb0cc8df71f9ebc0b271737b
2020-03-03 15:00:59 -08:00
7cda964e20 Remove deprecated codepath for old-style autograd.Function (#30696) (#33956)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/33956

Test Plan: Imported from OSS

Differential Revision: D20167359

Pulled By: glaringlee

fbshipit-source-id: 9b323bd29eca97bce0475225ad2b3b2ded29005d
2020-03-03 14:58:02 -08:00
04378eb618 [JIT] Add modulelist indexing for integer literal (#29236)
Summary:
Allow indexing into modulelists for integer literals.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/29236

Differential Revision: D19583935

Pulled By: eellison

fbshipit-source-id: 24d54051422a69769dac5e82f3bf622ded2bd8a6
2020-03-03 14:47:31 -08:00
ba1bd41767 Turn on strict dtype checking for test_torch.py (#33825)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33825

Partially addresses #20376

I do this by overriding assertEqual in classes that opt into
this.  This means I have to fix #33821.  The fix is a little
unsatisfactory as idiomatic Python 2 super() calls don't work
(since the class is no longer in scope); hopefully this will just
work when we go to Python 3.

General approach taken:
- A lot of dtype mismatches are because we specified tensor constants
  that infer to some dtype, but the actual dtype needed is something else.
  Those are easy, just annotate the tensor() constructor (often a legacy
  Tensor/FloatTensor call) with dtype
- There are a few cases where the promotion rules are nontrivial.  Some of them
  I just typed out the expected promotion rules manually (based on trial
  and error)
- There are some more complex cases; if it gets too hairy I just
  set exact_dtype=False and nope the fuck out

I don't have time to do it for all the other classes.  But the setup
should work if people just incrementally add the overrides to classes,
and then eventually flip the default.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Differential Revision: D20125791

Pulled By: ezyang

fbshipit-source-id: 389c2d1efbd93172af02f13e38ac5e92fe730c57
2020-03-03 14:45:53 -08:00
c579976603 Revert D20171428: [profiler] fix chrome tracing for profiler run with cuda
Test Plan: revert-hammer

Differential Revision:
D20171428

Original commit changeset: ec135a154ce3

fbshipit-source-id: 51ef4351a0df33fd087edbca1b7cd753cdbf1fdf
2020-03-03 14:36:01 -08:00
f299c2d6e1 Completely kill CUDA_tensor_apply3 (#34026)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/34026

Test Plan: Imported from OSS

Differential Revision: D20196078

Pulled By: VitalyFedyunin

fbshipit-source-id: 502184f412edee90a4f4c030def277a99a7369d4
2020-03-03 14:18:17 -08:00
1affaf8d10 Migrate lerp from CUDA_tensor_apply3 to TensorIterator (#34025)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/34025

Test Plan: Imported from OSS

Differential Revision: D20196079

Pulled By: VitalyFedyunin

fbshipit-source-id: 150d1de6632c58850020b73ee72e0ed380072926
2020-03-03 14:18:12 -08:00
27f56632a4 Migrate bce loss from CUDA_tensor_apply3 to TensorIterator (#34023)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/34023

Test Plan: Imported from OSS

Differential Revision: D20196084

Pulled By: VitalyFedyunin

fbshipit-source-id: bd000f09139cb848562e5310f10067db85e1b935
2020-03-03 14:16:40 -08:00
92083f31b5 [gloo] dont hold locks in calls to buffer in ProcessGroupGloo:RecvWork::wait() and (#33926)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33926

The UnboundBuffer calls here are already protected by a mutex. We only
need to hold the lock while writing the shared structures completed_ and
exception_.
ghstack-source-id: 99315427

Test Plan:
CI

CI

Differential Revision: D20154546

fbshipit-source-id: d1b74508c917b21acdcd0f6a914eb0455437ca0e
2020-03-03 13:28:45 -08:00
c93b1d427c [profiler] fix chrome tracing for profiler run with cuda (#33987)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33987

There was an error in
https://github.com/pytorch/pytorch/pull/30724/files that resulted in
`export_chrome_trace` generating invalid JSON. This only came up when the
profiler is run with `use_cuda=True` from what it looks like. In the future, we
should have tests that ensure we generate valid JSON because we no longer use
the json library.

Test Plan: Add UT to validate JSON.

Differential Revision: D20171428

fbshipit-source-id: ec135a154ce33f62b78d98468174dce4cf01fedf
2020-03-03 13:27:26 -08:00
6a97777f72 Remove use of .data from optimizers (#33640)
Summary:
Removes all uses of `.data` from optimizers.

Or tries to.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33640

Reviewed By: vincentqb

Differential Revision: D20203216

Pulled By: albanD

fbshipit-source-id: 9bfe78bbed00fd4aaa690801cff0201f0bd680a0
2020-03-03 13:21:55 -08:00
f26bbb5f86 [fix] flake8 lint error (#34146)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/34146

Test Plan:
.

Imported from OSS

Differential Revision: D20228830

fbshipit-source-id: 41de3c27c10256939ae6309d25b0499f708a3dca
2020-03-03 13:15:27 -08:00
a8fc3d8c2a Fix HistogramObserver to not do detach on input (#34114)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/33545, added a unittest
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34114

Differential Revision: D20224719

Pulled By: dzhulgakov

fbshipit-source-id: 053d3b3b0c86340027ba1b95b5f3c247aa151aee
2020-03-03 13:15:22 -08:00
9650253d70 [caffe2] fix ambiguous call to 'fmaxType' THCHalfAutoNumerics.cuh (#33569)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33569

Clang reported a few places where a call to `fmaxType` is ambiguous. In all cases one of the arguments is `double` and another is `float`. Fix the error by creating a proper value 0 and remove the unneeded `ZERO_MACRO` code.

Test Plan:
```lang=bash
buck build mode/opt -c fbcode.cuda_use_clang=true //fblearner/flow/projects/dper:workflow
buck build mode/opt //fblearner/flow/projects/dper:workflow
```
Execute tests on devgpu:
```
buck test mode/dev-nosan -j 8 //caffe2/caffe2/python/operator_test/... //caffe2/test:cuda
```

Reviewed By: ngimel

Differential Revision: D20006926

fbshipit-source-id: ca6cfacd57459b1c48eb5080b822d9509b03544d
2020-03-03 13:13:19 -08:00
49586a2a7e fix sph batchnorm to use sph fma
Summary: make use of springhill's fma on SpatialBatchnorm

Test Plan:
re-enabled the unit test, ran it a couple of times
pending: net runner

Reviewed By: amylittleyang

Differential Revision: D20227767

fbshipit-source-id: 7c601f185940249c0a32bdf95d74a20552cd2625
2020-03-03 12:53:08 -08:00
49921cad28 Minimum build should also exclude XNNPACK (#34110)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/34110

Differential Revision: D20228129

Pulled By: ezyang

fbshipit-source-id: 24e1482f6a6ff423de966bb7a7a45ad3815791e9
2020-03-03 12:51:37 -08:00
fbc9c61c81 randn and normal_ for complex tensors (#34037)
Summary:
1. randn and normal_ methods will work for complex tensors after this PR
2. added an internal function for viewing complex tensors as float tensors which enables us to reuse functions defined for float tensors for complex tensors with change in arguments passed(like size, standard deviation in case of normal_). currently the resultant new float tensor doesn't share the storage with the input complex tensor which means that the version counter wouldn't be updated if any function is called on this resultant tensor, but once the dtype entry is removed from the storage class, this issue will be resolved.

Side notes:
1. didn't add a separate header for the util functions because of this issue https://github.com/pytorch/pytorch/issues/20686#issuecomment-593002293
2. we should eventually have a public API method view_complex_as_float once (2) mentioned above gets resolved
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34037

Differential Revision: D20221793

Pulled By: anjali411

fbshipit-source-id: a78f5e83d6104e2f55e0b250c4ec32e8d29a14eb
2020-03-03 12:46:01 -08:00
ad2825a2c9 Add API for listing functions overridable by __torch_function__ (#33791)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/33182

This adds private API functions that developers of types that implement `__torch_function__` can use to ensure full coverage of the subset of the PyTorch API that can be overrided.

I've refactored some of the code in the tests into a new `torch._overrides.get_overridable_functions` function. I've also changed `TENSOR_LIKE_TORCH_OVERRIDES` into `torch._overrides.get_testing_overrides` and `IGNORED_TORCH_FUNCTIONS` into `torch._overrides.get_ignored_functions`. Making these two static global variables in the tests into functions should allow rewriting their implementation to construct their return values instead of just statically defining the return value as is done here. Currently that is blocked on not being able to inspect function signatures of compiled kernels in PyTorch (see https://github.com/pytorch/pytorch/issues/28233). See the docs I've added for usage examples of these new functions. I also refactored the existing override tests to make use of these new functions, which should be a good forcing function to make sure they're kept up-to-date.

Finally, while working on this I discovered that `TestTorchFunctionOverrides.test_mean` and `TestTorchFunctionOverrides.test_mm` weren't ever being run because they were getting clobbered by the other dynamically generated override tests. I fixed that by renaming the tests and then fixing the actual test code. I've verified that all the subclassing semantics is correct and that the updated test answers are correct. I'm happy to put the fixes to the existing tests in as a separate pull request if that would be easier to review.

ping cpuhrsch since the feature request originally came from them.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33791

Differential Revision: D20195053

Pulled By: cpuhrsch

fbshipit-source-id: 1585f4e405f5223932b410eae03a288dc8eb627e
2020-03-03 12:40:34 -08:00
358450e02b improved TorchScript traceback (#33834)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33834

This changes how we report Tracebacks to make them more clear when
there are both serialized and non-serialized ranges. It now looks like:

```
Traceback (most recent call last):
  File "foo.py", line 25, in <module>
    s2(a, b)
  File "/scratch/zdevito/pytorch/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
RuntimeError: The following operation failed in the TorchScript interpreter.
Traceback of TorchScript, serialized code (most recent call last):
  File "code/__torch__.py", line 7, in forward
    x: Tensor,
    y: Tensor) -> Tensor:
    return (self).bar(x, y, )
            ~~~~~~~~~ <--- HERE
  def bar(self: __torch__.Moo,
    x: Tensor,
  File "code/__torch__.py", line 11, in bar
    x: Tensor,
    y: Tensor) -> Tensor:
    _0 = (self).baz(x, y, )
          ~~~~~~~~~ <--- HERE
    _1 = torch.ones([3], dtype=None, layout=None, device=None, pin_memory=None)
    return torch.add(_0, _1, alpha=1)
  File "code/__torch__.py", line 17, in baz
    x: Tensor,
    y: Tensor) -> Tensor:
    return torch.add(x, y, alpha=1)
           ~~~~~~~~~ <--- HERE

Traceback of TorchScript, original code (most recent call last):
  File "foo.py", line 11, in forward
    def forward(self, x, y):
        return self.bar(x, y)
               ~~~~~~~~ <--- HERE
  File "foo.py", line 9, in bar
    def bar(self, x, y):
        return self.baz(x, y) + torch.ones(3)
               ~~~~~~~~ <--- HERE
  File "foo.py", line 7, in baz
    def baz(self, x, y):
        return x + y
               ~~~~~ <--- HERE
RuntimeError: The size of tensor a (4) must match the size of tensor b (5) at non-singleton dimension 1
```

It follows Python convension of having the most important information last
and reading from the bottom up.

Changes:
* Moved the error message to the end, to copy Python
* Report original traceback separate from serialized traceback
* Make sure root functions have names in the interpreter trace.

Test Plan: Imported from OSS

Differential Revision: D20126136

Pulled By: zdevito

fbshipit-source-id: fd01f9985e5d74e04c4d064c02e8bc320f4fac13
2020-03-03 12:27:38 -08:00
74a0663afd In torch_test, mark every test that takes >5s on a DEBUG CPU-only build as slow test (#33901)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33901

After this change, the pytest profile looks like:

4.83s call     test/test_torch.py::TestTorch::test_fft_ifft_rfft_irfft
4.23s call     test/test_torch.py::TestTorch::test_var_dim
4.22s call     test/test_torch.py::TestTorch::test_std_dim
4.19s call     test/test_torch.py::TestTorch::test_max
4.06s call     test/test_torch.py::TestTorch::test_min
3.60s call     test/test_torch.py::TestTorchDeviceTypeCPU::test_cdist_norm_batch_cpu
2.62s call     test/test_torch.py::TestTorchDeviceTypeCPU::test_pow_cpu
2.60s call     test/test_torch.py::TestTorch::test_matmul_small_brute_force_1d_Nd

And the entire CPU-only test suite can be run in 88s on my Intel(R) Xeon(R) CPU
E5-2650 v4 @ 2.20GHz

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Differential Revision: D20222288

Pulled By: ezyang

fbshipit-source-id: 4224a9117f42566e290ae202881d76f1545cebec
2020-03-03 11:49:49 -08:00
9b527b35bb CUDA Vectorized Dropout (#33879)
Summary:
Add vectorization to dropout kernels for both reads & writes. Moved the `masked_scale_kernel` implementation to `TensorIterator` to pick up recent autovectorization additions by zasdfgbnm , and wrote a vectorized specialization of the dropout training kernel (along with some fairly conservative dispatch logic).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33879

Differential Revision: D20222853

Pulled By: ngimel

fbshipit-source-id: 711f56ca907fbc792a10d4bf069c28adab7d6ad7
2020-03-03 11:43:45 -08:00
0cf34cf672 [pytorch][mobile] make sure mobile build work with dynamic dispatch (#34038)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34038

Mobile build doesn't include autograd/VariableType dispatch. As the
result AutoNonVariableTypeMode needs to be set in mobile runtime.

With static dispatch this works is done inside generated jit-dispatch
code - AutoNonVariableTypeMode needs to be set on per-op basis. Setting
it globally or setting it for wrong ops might break some `is_variable()`
checks in the codebase.

Thanks to the unification of Variable class and Tensor class, all
is_variable() checks have been removed, so AutoNonVariableTypeMode can
be set globally now.

We never tested inference-only mobile build with dynamic dispatch. It
seems that dynamic dispatch also requires setting AutoNonVariableTypeMode
for our mobile build (where VariableType functions are not registered).

Verified the end-to-end test works with this change:
```
TEST_CUSTOM_BUILD_DYNAMIC=1 test/mobile/custom_build/build.sh
```

Test Plan: Imported from OSS

Differential Revision: D20193329

Pulled By: ljk53

fbshipit-source-id: cc98414d89d12463dc82b0cdde0b6160dafc0349
2020-03-03 11:34:08 -08:00
51936c5ea4 [pytorch][CI] end-to-end custom build script (#34012)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34012

Today some mobile simulator tests only run on landed PRs and it requires
setting up special build environment to repro errors locally.

The goal of the PR is to do end-to-end mobile custom build & integration
tests with host toolchain (using same CMake options as mobile build). This
way, non-mobile engineers can capture & debug mobile related build issues
much more easily.

There are three custom build types that this script supports:

1. `TEST_DEFAULT_BUILD=1 ./build.sh` - it is similar to the prebuilt libtorch
libraries released for Android and iOS (same CMake build options + host
toolchain), which doesn't contain autograd function nor backward ops thus is
smaller than full LibTorch.

2. `TEST_CUSTOM_BUILD_STATIC=1 ./build.sh` - it further optimizes libtorch
size by only including ops used by a specific model.

3. `TEST_CUSTOM_BUILD_DYNAMIC=1 ./build.sh` - similar as 2) except that it
relies on the op dependency graph (instead of static dispatch) to calculate
and keep all transitively dependent ops by the model.

Type 2) will be deprecated by type 3) in the future.
Type 3) custom build has not been fully supported yet so it's expected to fail.

Replacing existing mobile build CI to run Type 1) build & integration test.

Test Plan: Imported from OSS

Differential Revision: D20193328

Pulled By: ljk53

fbshipit-source-id: 48c14cae849fde86e27123f00f9911996c1cf40e
2020-03-03 10:55:17 -08:00
5b9f1ada30 [quant][graphmode] Observing input/output values in call site (#33277)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33277

Currently we insert observer in the called graph, which is incorrect since graphs can be shared
and the decision of whether to insert observer or not might dependend on where the graph is called.
For example, for a call sequence `self.conv1(self.conv2(x))`, we can't inserting observer correctly
if `self.conv1` and `self.conv2` are sharing the same type in the current implementation, because we insert
observer in the graph of the forward method of Conv2d right now and this call sequence requires us to insert
only one observer for the output of self.conv1/input of self.conv2.
We'll need to insert observers for input/output values of the graph in call site instead.

Test Plan:
python test/test_jit.py

Imported from OSS

Differential Revision: D20208787

fbshipit-source-id: 739e1d877639c0d0ed24e573bbd36211defa6836
2020-03-03 10:53:24 -08:00
7289e8e865 [caffe2] std::numeric_limits<double>::quiet_NaN() use instead of ::nan("") (#33566)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/33566

Test Plan:
```lang=bash
buck build mode/opt -c fbcode.cuda_use_clang=true //fblearner/flow/projects/dper:workflow
buck build mode/opt //fblearner/flow/projects/dper:workflow
```
Execute tests on devgpu:
```
buck test mode/dev-nosan -j 8 //caffe2/caffe2/python/operator_test/... //caffe2/test:cuda
```

Reviewed By: ngimel

Differential Revision: D20006447

fbshipit-source-id: ec522bc2065ad033ee2eeedd26d4a8a7a27e5f56
2020-03-03 10:42:58 -08:00
1702152ef9 fixup unit tests (#34105)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34105

make parallel_net_test.cc chronos conforming.
exclude gtest asserts that check thrown exceptions when exceptions are disabled.

Test Plan: CI green

Differential Revision: D20153525

fbshipit-source-id: 7371e559da948f46773fed09e3a23a77411d59e0
2020-03-03 10:33:21 -08:00
5082839de5 Migrate Lerp from CUDA_tensor_apply4 to TensorIterator (#33994)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/33994

Test Plan: Imported from OSS

Differential Revision: D20196788

Pulled By: VitalyFedyunin

fbshipit-source-id: e5e281460e8cca7ea3911fe56549e1ab62d50e76
2020-03-03 09:38:49 -08:00
4074d559e4 Migrate kl_div_backward from CUDA_tensor_apply3 to TensorIterator (#34022)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/34022

Test Plan: Imported from OSS

Differential Revision: D20196080

Pulled By: VitalyFedyunin

fbshipit-source-id: 265884dc01c3260197776ee5baaadbe6b523fede
2020-03-03 09:33:31 -08:00
3def76583a [RESUBMIT] [pytorch] Migrating index_add cuda to ATen (#33548)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33548

Mostly just moved code.
Index dim and number of indices checks are added to make checks idential to index_add_cpu_

This is a resubmit of #30573, which got reverted.

Test Plan: Imported from OSS

Differential Revision: D20002248

Pulled By: gchanan

fbshipit-source-id: 46df4047cb3fc1dff37a15b83c70b2cbb7a6460b
2020-03-03 09:06:13 -08:00
f29110fdf8 [pytorch] blas gemm fix for k=0 (#33819)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33819

These conditions are for the specific implementation, the fallback implementation works without these checks. So use that if any of these checks isn't true.

Resubmit of https://github.com/pytorch/pytorch/pull/33419 (which got reverted due to a problem with XLA, but which now has been fixed)
ghstack-source-id: 99333280

Test Plan: Test included

Differential Revision: D20121460

fbshipit-source-id: c1056b8e26751e24078bbe80c7cb4b223bcca7cb
2020-03-03 08:56:05 -08:00
b1fd7ba019 Revert D20169501: [pytorch][PR] .circleci: Add CUDA 10.2 to our CI pipeline
Test Plan: revert-hammer

Differential Revision:
D20169501

Original commit changeset: 43b7ca680200

fbshipit-source-id: dbeb0315ccc06b8e082d019cd1ffcd97e1d38e04
2020-03-03 08:15:36 -08:00
1aff3e2dd3 Revert D20204104: [pytorch][PR] .circleci: Add filter to run nightly builds on tag
Test Plan: revert-hammer

Differential Revision:
D20204104

Original commit changeset: 685630e8a04b

fbshipit-source-id: 1f4c890b0b199b406bac51e30febb8c6482e7e31
2020-03-03 08:03:03 -08:00
cyy
5be8a4e027 find mkl installed by nuget (#34031)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/34031

Differential Revision: D20221807

Pulled By: ezyang

fbshipit-source-id: 827e2775956f408febb287676bbf9a96a70fe2d4
2020-03-03 07:44:20 -08:00
a23e8099dd Fix typo (#34008)
Summary:
This PR removes apparently unnecessary dots in the documentation of `torch.t`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34008

Differential Revision: D20195084

Pulled By: ezyang

fbshipit-source-id: a34022de6b7a32d05a0bb3da197ee3507f4b8d8e
2020-03-03 07:38:40 -08:00
2ce9d26809 Support cdf for mixture_same_family distribution (#33408)
Summary:
The new added mixture_same_family should support cdf if the family has cdf implemented.

This is very useful for flow models where cdf of mixture of gassian/logistic is used to model flow
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33408

Differential Revision: D20191552

Pulled By: ezyang

fbshipit-source-id: 0bfd7973aa335c162919398a12ddec7425712297
2020-03-03 07:31:24 -08:00
e0b90b87a4 [C2] Fix slowness of the ReshapeOp. (#33729)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33729

ReshapeOp is doing some useless movements of data between CPU and GPU, which results in crazy amount of kernel calls from this operator. Which makes this operator ridiculosly slow compared to BatchMatMul for cases of pretty cheap models (for example on some versions of GAT).

This diff is moving ReshapeOp to leverage CPU storage and reduce amount of kernel calls from num_dims + 3 calls for case of 3-D
tensor to 2 calls.

Test Plan:
Unit-tests are still passing.

TODO: perf testing

Reviewed By: akyrola

Differential Revision: D19659491

fbshipit-source-id: 2341b21e57208b988169f2df5fb598be3dc8acb2
2020-03-03 00:44:22 -08:00
0afee0c20b [rpc][metrics] add initial metric handler classes. (#33153)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/33153

Test Plan: Added unit tests.

Reviewed By: pritamdamania87

Differential Revision: D19615364

fbshipit-source-id: e0447463651390b08ad48e134cb73764d8dcf4f3
2020-03-02 22:03:12 -08:00
0689cf8fc1 [c10] Make __assert_fail CUDA definition compilable with clang host compiler (#34102)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34102

if nvcc is invoked with clang host compiler, it will fail with the following error due to the decorators mismatch defined in cuda and c10:
```
 error: attribute "noreturn" did not appear on original declaration
```

Test Plan: Build pytorch with clang

Reviewed By: EscapeZero

Differential Revision: D20204951

fbshipit-source-id: ff7cef0db43436e50590cb4bbf1ae7302c1440fa
2020-03-02 20:11:49 -08:00
cyy
8a14b41617 fix warnings reported by PVS (#33868)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/33868

Differential Revision: D20169059

Pulled By: ailzhang

fbshipit-source-id: ec12226ae27ddd89fa5bacdd35151981ebfedcfd
2020-03-02 18:51:38 -08:00
0729ad733d Change lint from python2 -> python3 (#34107)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34107

Updates linter to only lint for python3 instead of linting for python2

Test Plan: good_testplan

Reviewed By: orionr

Differential Revision: D20205395

fbshipit-source-id: 1fa34e5fdf15f7aed96a66d2ce824a7337ee6218
2020-03-02 18:11:42 -08:00
f909b5535e [autograd] fix allow_unused checking for C++ API (#34035)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34035

Bug for the conditon check in https://github.com/pytorch/pytorch/pull/24342, realized we don't have tests in either
python or cpp to catch this, so added testes for both python and cpp.

Thanks hczhu on capturing it!

Test Plan: Imported from OSS

Differential Revision: D20198837

Pulled By: wanchaol

fbshipit-source-id: 33846a14c0a8e7aac2e8328189d10c38a0d7e6ee
2020-03-02 17:57:15 -08:00
0759191f12 blacklist spatialBN until bitwise matching (#34092)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34092

Disable op in transform map until we get bitwise matching to ice-ref

Test Plan: CI

Reviewed By: hyuen

Differential Revision: D20177936

fbshipit-source-id: e316384184cb264852e63e5edce721a8614742d1
2020-03-02 17:55:00 -08:00
3b93928ada .circleci: Add filter to run nightly builds on tag (#34078)
Summary:
## What this will do:

When the repository is tagged the current nightly build pipelines will run and upload to the `test` subdirectory in our S3 bucket for `download.pytorch.org`. Will also upload to the correct organization on anaconda [pytorch-nightly](https://anaconda.org/pytorch-test)

This is only meant for release candidates and will actually not run on any tag that does not match the release candidate regex.

This has been tested on a small scale with: 3ebe0ff2f8

## Related PRs:
* `.circleci: Divert packages to test channel on tag`: https://github.com/pytorch/pytorch/pull/33842
* `.cirlceci: Swap PYTORCH_BUILD_VERSION if on tag`: https://github.com/pytorch/pytorch/pull/33326

## Work to be done later:
- [ ] Figure out how to remove manual step of updating s3 html indices.

Signed-off-by: Eli Uriegas <eliuriegas@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34078

Differential Revision: D20204104

Pulled By: seemethere

fbshipit-source-id: 685630e8a04b19fc17374585e9228a13a8c3e407
2020-03-02 17:20:21 -08:00
ad3f4a32bd [pytorch][buck] fix selective buck build (#34090)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34090

Update the per-op-registration template file to use the new c10 registration API.
ghstack-source-id: 99318973

Test Plan:
```
buck build -c pt.selective_build=1 \
fbandroid/mode/dev_clang_libcxx fbandroid/mode/server \
xplat/caffe2/fb/lite_predictor:lite_predictor_resnet
```

Differential Revision: D20200452

fbshipit-source-id: dc619cb6bdfc0c787b87475eb24b6a2da29e70e2
2020-03-02 17:13:08 -08:00
1ed950e1b6 [distributed] skip use_ignore_output tests in c10d if not built with gloo (#33513)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33513

These tests require gloo so like the other tests, they should be
skipped if not building with gloo. Otherwise they crash on Mac if not built
with gloo currently.

verified that it does not crash anymore with this PR.
ghstack-source-id: 99303707

Test Plan: Built on Mac and verified that the tests do not fail.

Differential Revision: D19976908

fbshipit-source-id: 6a2a70c3eab83efd0e188e86cabe56de4a869f4d
2020-03-02 16:43:21 -08:00
ff1fc402a8 Migrate dirichlet from CUDA_tensor_apply3 to TensorIterator (#34021)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/34021

Test Plan: Imported from OSS

Differential Revision: D20196082

Pulled By: VitalyFedyunin

fbshipit-source-id: 9736a0ebbc529975e95a4f996dbc28e070cf1e63
2020-03-02 16:31:32 -08:00
77b9016a8e Migrate gamma grad from CUDA_tensor_apply3 to TensorIterator (#34020)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/34020

Test Plan: Imported from OSS

Differential Revision: D20196083

Pulled By: VitalyFedyunin

fbshipit-source-id: 8659bc004678a656071263c94e929f2e1a686812
2020-03-02 16:29:45 -08:00
bb4465f9f5 .circleci: Add CUDA 10.2 to our CI pipeline (#33471)
Summary:
Adds support for CUDA 10.2 builds on our nightly pipelines / regular test pipeliens.

Depends on https://github.com/pytorch/builder/pull/404
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33471

Test Plan: sandcastle_will_deliver

Reviewed By: ezyang

Differential Revision: D20169501

Pulled By: seemethere

fbshipit-source-id: 43b7ca680200a67fa88ad4f7b5a121954c9f089d
2020-03-02 15:50:48 -08:00
b874c039f6 Allow checking for cached module before asserting (#33954)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33954

fixes caffe2/core/module_test.cc on windows
misc lint fixes.

Test Plan: CI green

Reviewed By: malfet

Differential Revision: D20153512

fbshipit-source-id: aeae84a028e26edd65c7218611e3c49a8d9bb8c0
2020-03-02 15:43:50 -08:00
a4716d0e26 Fix lint (#34094)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/34094

Pulled By: driazati

Differential Revision: D20201433

fbshipit-source-id: d8292b329aebd232556db517b71daeee3f266bfc
2020-03-02 15:34:52 -08:00
c206b4398d Show errors from the tasks in the thread pool (#33938)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33938

Making sure we don't silently ignore exceptions from the tasks in the
thread pool

Test Plan: python setup.py clean && python setup.py develop install

Differential Revision: D20178603

Pulled By: ilia-cher

fbshipit-source-id: 34971032205a1a53fb7419ed84ebb986f9e959ad
2020-03-02 14:49:52 -08:00
a57a7b4c29 Change input value in examples of BCEWithLogitsLoss (#34053)
Summary:
In the examples of `BCEWithLogitsLoss`, `0.999` is passed as the prediction value. The value `0.999` seems to be a probability, but actually it's not. I think it's better to pass a value that is greater than 1, not to confuse readers.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34053

Differential Revision: D20195456

Pulled By: ezyang

fbshipit-source-id: 3abbda6232ee1ab141d202d0ce1177526ad59c53
2020-03-02 14:35:56 -08:00
15bf4892f2 prevent crash on exit from static destructor race (#33955)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33955

unit tests on windows (clang and cl) were crashing on exit due to racing with static variable destruction.

Test Plan: CI green

Differential Revision: D20153587

fbshipit-source-id: 22e35e591660d49f3a755f93d0c14d7a023ebb2a
2020-03-02 14:28:13 -08:00
e568c039bd Enable Tensor.random_(from, to) for half on CPU (#34030)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/34030

Test Plan: Imported from OSS

Differential Revision: D20182412

Pulled By: pbelevich

fbshipit-source-id: b7439e6d66e1c0b9ffa8b397cab057c9146f5714
2020-03-02 14:22:35 -08:00
384a4feab6 Fix bad math typesetting (#34027)
Summary:
Fixing documentation.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34027

Differential Revision: D20195235

Pulled By: ezyang

fbshipit-source-id: 0281bc0e8718e700e0982ced1342969b367ba57c
2020-03-02 14:16:34 -08:00
11843049d5 [jit] Fix flipped PackedSequence outputs in script (#32955)
Summary:
Stacked PRs
 * **#32955 - [jit] Fix flipped PackedSequence outputs in script**
 * #32953 - [jit] Support properties on `Device`

Fixes #32605

Pull Request resolved: https://github.com/pytorch/pytorch/pull/32955

Pulled By: driazati

Differential Revision: D20165514

fbshipit-source-id: a130c438b40e51ec27d36f021b0dc7869570aa6a
2020-03-02 13:50:36 -08:00
45c45195cd Remove warning about building from source to use the NCCL backend (#34051)
Summary:
I think this warning isn't true anymore, and the NCCL backend works without PyTorch needing to be built from source.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34051

Differential Revision: D20195310

Pulled By: ezyang

fbshipit-source-id: 14f879a8c43ea5efdbdf0f638792ea2b90011f4a
2020-03-02 13:43:43 -08:00
51d969e86a preprocessor cleanup (#33957)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33957

lots of small preprocessor warning cleanup for windows

Test Plan: CI green

Reviewed By: malfet, albanD

Differential Revision: D20153582

fbshipit-source-id: 18fd61c466fd1f55ededdae4448b3009a9cedc04
2020-03-02 13:37:19 -08:00
4b3ae7e0af Enable -Werror=format compile errors on torch exception types (#34019)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/33899

In the issue, we have
```
TypeError("expected %s (got %s)", dispatch_key, toString(other.key_set()).c_str());
```
which results in `dispatch_key` being interpreted as a c-string by `sprintf`. Adding `__attrbute__((format))` to the `TypeError` constructor allows gcc or clang to detect this at compile time. Then `-Werror=format` makes it a hard error at compile time.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34019

Differential Revision: D20194842

Pulled By: ezyang

fbshipit-source-id: fa4448916c309d91e3d949fa65bb3aa7cca5c6a8
2020-03-02 13:25:39 -08:00
9239608037 fix windows clang attributes (#33959)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33959

make sure clang on windows uses correct attributes.
add support for cl.exe style pragma attributes

Test Plan: CI green

Differential Revision: D20153548

fbshipit-source-id: bfbfd374e8f5e7d7b8598453c3ca2b6693a425f1
2020-03-02 13:20:51 -08:00
87b3f87f27 Migrate prelu from CUDA_tensor_apply2 to TensorIterator (#34003)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/34003

Test Plan: Imported from OSS

Differential Revision: D20196994

Pulled By: VitalyFedyunin

fbshipit-source-id: 1749a968b1ec6636e08c11c93de43b5599e7cf4b
2020-03-02 12:49:32 -08:00
9956a231b9 Fix backward compatibility tests (#34071)
Summary:
1. As RRef has been added as a JIT type in https://github.com/pytorch/pytorch/issues/32992, we no longer need to skip them
2. Nightly now knows about Any
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34071

Reviewed By: houseroad

Differential Revision: D20196963

Pulled By: mrshenli

fbshipit-source-id: 1ea79c5682e8be9087b9cb74104e1b914c3fc456
2020-03-02 12:42:33 -08:00
ec0f2184ba clang intrinsics targeting (#33958)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33958

look for clang intrinsic headers on windows

Test Plan: CI green

Differential Revision: D20153573

fbshipit-source-id: c87da3b0e9950d3df0bf8350df8ae592064d6613
2020-03-02 12:37:07 -08:00
ba4cff2ffc [dtype inference] Following pytorch default for float vs double (#33713)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/33713

Differential Revision: D20193387

Pulled By: anjali411

fbshipit-source-id: d802ec395df4e75e2be02e91d7288ae6fb7cf8e0
2020-03-02 11:56:34 -08:00
cab8772c6c Freezing Torchscript modules (#32178)
Summary:
This patch enables folding GetAttr nodes with their corresponding
values. _jit_pass_freeze_module API returns a new TorchScipt module
where all function calls and get attributes are inlined.
Usage:

frozen_model = torch._C._freeze_module(scrited_model._c)
frozen_model.forward(...)

This API currently optimizes the forward method. We will follow up to
to preserve and optimize methods and attributes that are annotated as
 torch.jit.interface.

Several future improvements to JIT optimizations are required to maximize
clean up/de-sugar the graph and eliminate redundancies.
Ideally, we want to produce a graph that can easily be lowered to
GLOW and other low-level backends.
__
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32178

Differential Revision: D19419640

Pulled By: bzinodev

fbshipit-source-id: 52baffaba9bca2cd60a8e747baa68d57711ad42b
2020-03-02 11:38:36 -08:00
e73d4286b0 Fix conflict between XNNPACK's clog dependency and our cpuinfo dependency (#33922)
Summary:
Currently if we run

```bash
DEBUG=1 ONNX_ML=0 MAX_JOBS=8 CMAKE_CXX_COMPILER_LAUNCHER=ccache CMAKE_C_COMPILER_LAUNCHER=ccache CMAKE_CUDA_COMPILER_LAUNCHER=ccache USE_OPENMP=0 USE_DISTRIBUTED=0 USE_MKLDNN=0 USE_NCCL=0 USE_CUDA=1 USE_CUDNN=0 USE_STATIC_CUDNN=0 USE_NNPACK=0 USE_QNNPACK=0 USE_FBGEMM=0 BUILD_TEST=0 TORCH_CUDA_ARCH_LIST="6.1" python setup.py develop --cmake-only
```

then `touch build/CMakeCache.txt` (which adjusting build options will
do), then `python setup.py develop`, the following error message will
show up:

```
CMake Error at build/clog-source/CMakeLists.txt:249 (ADD_SUBDIRECTORY):
ADD_SUBDIRECTORY not given a binary directory but the given source
directory "/home/hong/wsrc/pytorch/build/clog-source" is not a subdirectory
of "/home/hong/wsrc/pytorch/build/clog-source".  When specifying an
out-of-tree source a binary directory must be explicitly specified.
```

This is due to a conflict between our cpuinfo submodule and XNNPACK's
external clog dependency. Moving our cpuinfo upward and setting
CLOG_SOURCE_DIR resolves the issue.

 ---

Also reverted https://github.com/pytorch/pytorch/issues/33947 , where `CLOG_SOURCE_DIR` as an option is not quite appropriate (given that cpuinfo uses its included clog subdir) and the setting of this variable should be a bit later when the dir of cpuinfo is known.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33922

Differential Revision: D20193572

Pulled By: ezyang

fbshipit-source-id: 7cdbdc947a6c7e0ef10df33feccb5b20e1b3ba43
2020-03-02 10:40:12 -08:00
Jie
e54b8e1a47 [CUDNN NHWC CONVOLUTION] Re-stride input tensors for wgrad in cudnn_convolution (#33784)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/33784

Differential Revision: D20127485

Pulled By: VitalyFedyunin

fbshipit-source-id: 9d893ffe7ff9499e7e9a7e8bed720e9441d1018e
2020-03-02 10:05:59 -08:00
31737e989d [aten] remove shadowed declaration warning (#34014)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34014

Remove warning
```
caffe2/aten/src/ATen/core/op_registration/op_registration.h: In lambda function:
caffe2/aten/src/ATen/core/op_registration/op_registration.h:704:47: warning: declaration of ‘c10::DeviceType t’ shadows a parameter [-Wshadow=compatible-local]
   auto deviceTypeToDispatchKey = [](DeviceType t){
                                               ^
caffe2/aten/src/ATen/core/op_registration/op_registration.h:703:21: note: shadowed declaration is here
 inline CppFunction dispatch(DeviceType t, Func&& raw_f) {
          ~~~~~~~~~~~^
```

Test Plan: CI

Reviewed By: dzhulgakov

Differential Revision: D20181155

fbshipit-source-id: 41947d171369b9bd7a87e3e367492f9b2165fd6b
2020-03-02 09:22:13 -08:00
ad17dafc50 [caffe2] Remove python2 from operator_test (#33977)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33977

Removing python2 from operator_test so we can retire python2 support for PyTorch.

Test Plan: waitforsandcastle

Reviewed By: seemethere

Differential Revision: D20129500

fbshipit-source-id: d4c82e4acfc795be9bec6a162c713e37ffb9f5ff
2020-03-02 08:55:53 -08:00
f4532d7542 Fix typo (#33925)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/33925

Differential Revision: D20171970

Pulled By: vincentqb

fbshipit-source-id: 5c1a8553760f74cecebaea7e88463b767ab81211
2020-03-02 08:13:55 -08:00
71f8624ecb Revert D19153199: [ATen] Remove AT_ASSERTM from Blob::free_()
Test Plan: revert-hammer

Differential Revision:
D19153199

Original commit changeset: f93983d5bf32

fbshipit-source-id: d79cf659f3cb26427196b9d9d1fe44e15874ad79
2020-03-02 07:35:35 -08:00
6631c2a627 [doc] Add grad context manager doc to toplevel torch module. (#33877)
Summary:
fixes https://github.com/pytorch/pytorch/issues/32014
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33877

Differential Revision: D20141801

Pulled By: albanD

fbshipit-source-id: bac713382a71666dd5e2499f710c51a55cc579ba
2020-03-02 06:32:36 -08:00
a500491cbc Fix index_put when tensor length > int_max (#33753)
Summary:
This PR would fix https://github.com/pytorch/pytorch/issues/33345.

The original CUDA kernel looks good. I changed most appearances of `int` to `int64_t` to avoid the CUDA memory access issue. Removed the two `TORCH_CHECK`. Added a unit test.

cc csarofeen ngimel ptrblck
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33753

Differential Revision: D20185005

Pulled By: ngimel

fbshipit-source-id: ef0abdc12ea680e10fe6b85266e2773c7a272f0d
2020-03-01 21:51:23 -08:00
f857fe18cd [ATen] Remove AT_ASSERTM from Blob::free_() (#33929)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33929

`Blob::~Blob()` calls `Blob::free_()`. `Blob::free_()` throws and destructors should not throw.

A few other minor tweaks include:
- Remove `static_cast<void*>()` in `ShareExternal`
- Remove default values of `pointer_` and `has_ownership_`

Test Plan:
```
buck test caffe2/caffe2:caffe2_test_cpu
```

https://our.intern.facebook.com/intern/ads/canary/424941782651397826
https://our.intern.facebook.com/intern/ads/canary/424941799628450155

Reviewed By: yinghai

Differential Revision: D19153199

fbshipit-source-id: f93983d5bf324b9a464ad2d1ed0dba13f807d2f6
2020-03-01 21:09:04 -08:00
e017b1e9fb Updating submodules
Summary:
GitHub commits:

af57f36db0

Test Plan: n/a

Reviewed By: wittgenst

fbshipit-source-id: 4bd71218aee5e2a20a3496f2a51d464a19c0f879
2020-03-01 20:54:32 -08:00
ad769d74d9 Collapse _like overloads into a single overload. (#33705)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33705

The fact that there were two overloads appears to be a historical
artifact that dates back to when goldsborough originally added these
bindings in the first place.  If TensorOptions is made optional,
then you only need one overload, not two, as they are exactly redundant
with each other.  When MemoryFormat was added, it was made a little
harder to do this, as the C++ syntax at::empty_like(t, memory_format) would
not work if you collapsed the overload; but now it works because TensorOptions
supports MemoryFormat.

The upshot is, I can get rid of all the overloads and just have one overload.
Amazingly, this change is backwards compatible, as the test attests.  While
I was at it, I also deleted the overload name from the functions entirely.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Differential Revision: D20073355

Pulled By: bhosmer

fbshipit-source-id: c6a8908213b32ccf6737ea864d135e2cce34f56b
2020-03-01 19:40:22 -08:00
b98bce8cd4 Add MemoryFormat to TensorOptions, but not codegen. (#33704)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33704

This diff adds MemoryFormat field to TensorOptions, and teaches
all kernels that take TensorOptions to respect it, but doesn't
teach the codegen about it.  As such, it is now possible to specify
memory_format using TensorOptions syntax, e.g.,
at::empty_like(tensor, at::memory_format(MemoryFormat::Contiguous))
in the C++ API, but there isn't any other user visible effect.

The intended end state of this diff stack is to eliminate the
explicit MemoryFormat? arguments from native functions, but
as this change has BC implications I'd prefer to do it separately.
So this starts things off with a non-BC breaking addition to the
API.  For all internal functions that are not bound by codegen,
I switch them to exclusively using TensorOptions (eliminating
MemoryFormat); there's only a few, mostly quantized and to().

To keep things screwed down in the short term, it is a HARD ERROR
to specify both the explicit MemoryFormat argument as well as
TensorOptions.  This caught a few errors in my diff where I needed
to modify memory format settings and then call code later, esp
in empty_like.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Differential Revision: D20073356

Pulled By: bhosmer

fbshipit-source-id: 18d310d7ee7cf2ee182994104652afcfc9d613e2
2020-03-01 18:22:12 -08:00
9f7708eecb Updating submodules
Summary:
GitHub commits:

8c1badaa4a
ce1ee42199
b23caba073
aa48f50c9a
f7695cddae
8a386d9549
baab5386e2

Test Plan: n/a

Reviewed By: wittgenst

fbshipit-source-id: 6c036499de97418afd9337979e89365ce13ceee7
2020-03-01 16:05:00 -08:00
15caf3b516 move test helper functions out of test funciton (#33960)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33960

test helper functions should be out of test function. it is possible process 2 launches test functions slower than process 1, and process 1 sends request to run a helper function on process 2. process 2 may have not compile the helper function yet when process 2 starts to serve processs 1's request, and thus may return error like "attempted to get undefined function"
ghstack-source-id: 99205620

Test Plan: test_remote_script_module was flaky for thrift backend in my local stress test runs, due to error "attempted to get undefined function". With fix in this diff, stress runs passed

Differential Revision: D20167969

fbshipit-source-id: 8a2b9cd7bd62462e24bdbcb69ad32dca745d6956
2020-03-01 14:16:56 -08:00
84ec5357d3 Make HashNode visible (#34045)
Summary:
HashNode and CompareNode are useful functions for hanlding jit::Node. This is to unblock https://github.com/pytorch/glow/pull/4235.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34045

Reviewed By: houseroad

Differential Revision: D20184733

Pulled By: yinghai

fbshipit-source-id: 6c829f2f111a490fd2d85017475c1731cd97fb20
2020-03-01 12:28:18 -08:00
ace2b4f37f [resubmit] try to infer rref type from python (#33992)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33992

resubmit of https://github.com/pytorch/pytorch/pull/33369 with tweaks on when the rref type being created to ensure ivalue->type() hold the correct RRef type inside of inner element type.

Test Plan: Imported from OSS

Differential Revision: D20175043

Pulled By: wanchaol

fbshipit-source-id: a08b178e989c995632374e6c868d23c5a85526ae
2020-02-29 20:26:40 -08:00
7747fe81c4 reuse named tensor error message in generated code (#33536)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33536

Simple fix, merge the identical string literals that were being inlined into every wrapper for ops that don't support named tensors. E.g.
```
Tensor all(const Tensor & self, int64_t dim, bool keepdim) {
    if (self.has_names()) {
        AT_ERROR(
            "all is not yet supported with named tensors. Please drop names via "
            "`tensor = tensor.rename(None)`, call the op with an unnamed tensor, "
            "and set names on the result of the operation.");
    }
    const OptionalDeviceGuard device_guard(device_of(self));
    return at::native::all(self, dim, keepdim);
}
```
becomes
```
Tensor all(const Tensor & self, int64_t dim, bool keepdim) {
    if (self.has_names()) {
        AT_ERROR("all", named_tensors_unsupported_error);
    }
    const OptionalDeviceGuard device_guard(device_of(self));
    return at::native::all(self, dim, keepdim);
}
```

Also updated the generated file comments to include the source template names, e.g.
```
// generated by aten/src/ATen/gen.py from TypeDefault.cpp
```

Test Plan: Imported from OSS

Differential Revision: D19993407

Pulled By: bhosmer

fbshipit-source-id: 88395a649e6ba53191332344123555c217c5eb40
2020-02-29 17:00:13 -08:00
7f7ea685c0 Revert D18672405: Use codegen'ed unboxing wrappers
Test Plan: revert-hammer

Differential Revision:
D18672405

Original commit changeset: bf2a7056082d

fbshipit-source-id: b7ef1529fc266b4856e49e4dbd1fe8c7ba3d455d
2020-02-29 15:27:54 -08:00
3acfccafbb Revert D20172782: Fix mobile build
Test Plan: revert-hammer

Differential Revision:
D20172782

Original commit changeset: e4bfca2a6076

fbshipit-source-id: 3093efd4a135f8d6c3174887ad1e3362aad9aa7c
2020-02-29 15:21:07 -08:00
595445e889 Revert D20178827: Fix mobile build
Test Plan: revert-hammer

Differential Revision:
D20178827

Original commit changeset: 980ac3d1ab3d

fbshipit-source-id: 9af6cb319e80c9b6a916bbdeffd69920075c7aec
2020-02-29 15:04:35 -08:00
c596ec7eb3 [pytorch] update code analyzer script to cover new c10::Module::def API (#33975)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33975

Currently the code analysis script doesn't go beyond the scope of the
registration API call, i.e. calling registration via a wrapper will not
be covered by the analysis - currently the new API is essentially a
wrapper around old API.

Simply adding the new API signature to the registration API pattern can
solve the problem for now. We might need change the analyzer code if
things change significantly in the future.

Test Plan:
- update test project to use the new API;
- run analyzer against pytorch codebase;

Differential Revision: D20169549

Pulled By: ljk53

fbshipit-source-id: c7925fa0486eee18f07e791a38c32152fee59004
2020-02-29 10:29:45 -08:00
5a8562a6af Fix mobile build (#34000)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34000

-
ghstack-source-id: 99241400

Test Plan: liujiakai

Differential Revision: D20178827

fbshipit-source-id: 980ac3d1ab3d47c12613c20ee9b8dc7d083f56a9
2020-02-28 23:28:00 -08:00
1494005cfd C++ tensor indexing: more indexing tests (#30427)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/30427

Test Plan: Imported from OSS

Differential Revision: D18695899

Pulled By: yf225

fbshipit-source-id: 74455fe52ef922556fabe65aefca9ec93fe2346d
2020-02-28 22:07:41 -08:00
0e52627358 Fixing pthreadpool symbol conflict issue. (#33869)
Summary:
Mainly renaming pthread_create of C2, the only one referred internally in NNPACK, that
is conflicting, to pthread_create_c2.
Removed 2 other conflicting symbols that are not used internally at all.
Pointing XNNPACK to original repo instead of the fork.

Copy pasted the new interface and implementation to
caff2/utils/threadpool, so that for internal builds we compile against
this.

When threadpool is unified this will be removed.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33869

Differential Revision: D20140580

Pulled By: kimishpatel

fbshipit-source-id: de70df0af9c7d6bc065e85ede0e1c4dd6a9e6be3
2020-02-28 21:23:18 -08:00
85b1c45a45 [JIT] fix alias assertion (#33952)
Summary:
This bug has been hit a couple times recently. We need to handle all bivariant types, not just optional, when asserting mutability/immutability of pointed-to elements in alias analysis.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33952

Differential Revision: D20166025

Pulled By: eellison

fbshipit-source-id: cf3df9897a639641ef8303a08ba2b13523d01ef1
2020-02-28 19:54:29 -08:00
2111c4ff0c [jit] Add missing tensor properties (#33906)
Summary:
Fixes #30775

This adds TorchScript implementations (copied from `python_variable.cpp`) for the remainin `Tensor` properties that were missing from the jit, in addition to a test that ensures new properties will trigger a failure so we can decide whether we want to add them as well.

For `some_tensor`, adds:

* `some_tensor.T`
* `some_tensor.ndim`
* `some_tensor.is_leaf`
* `some_tensor.name`
](https://our.intern.facebook.com/intern/diff/20153288/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33906

Pulled By: driazati

Differential Revision: D20153288

fbshipit-source-id: 2ddc48a14267077bc176065267e5ce52181b3d6b
2020-02-28 19:06:11 -08:00
6e70b2da62 Fix mobile build (#33985)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33985

This was broken by https://github.com/pytorch/pytorch/pull/32521 but only showed up in master CI builds
ghstack-source-id: 99220995

Test Plan: CI

Differential Revision: D20172782

fbshipit-source-id: e4bfca2a6076f1bc1c562fca9c7dfcb156bfbf3e
2020-02-28 18:43:18 -08:00
2f6ffe8c39 [jit] Resolve type annotation names to types (#29623)
Summary:
This adds some machinery so that we use Python to resolve types to a value and the corresponding resolution logic in `annotations.py` instead of using the string.

This PR also `slowTests` a random test since it was taking > 1 min whereas all the other tests take < 10 seconds.

Fixes #31864
Fixes #31950
](https://our.intern.facebook.com/intern/diff/20144407/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/29623

Pulled By: driazati

Differential Revision: D20144407

fbshipit-source-id: ef3699f6b86039d8b4646ffc42c21bd1132d1681
2020-02-28 18:35:10 -08:00
55b44f6746 Throw an exception when method cannot be found from mobile module. (#33972)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/33972

Test Plan: Imported from OSS

Differential Revision: D20168965

Pulled By: iseeyuan

fbshipit-source-id: 2efe5dcb1fb80407cd88a47c50cb382ecd8aa275
2020-02-28 18:28:09 -08:00
de55e47a4b Pass all ops to XLA with additional info about whether it's compound (#33908)
Summary:
This PR prepares us to allow XLA use `XLAPreAutograd` to override compound ops.
To do this we'll need to pass all ops, with additional infomation about whether it's compound or not for XLA to parse.
Companion PR: https://github.com/pytorch/xla/pull/1698
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33908

Differential Revision: D20149585

Pulled By: ailzhang

fbshipit-source-id: a93140e8a34548fcabcea454386d15df58177c1d
2020-02-28 18:17:23 -08:00
38b6cb479b Check fuser results when profiling (#33944)
Summary:
With the profiling executor enabled the fuser won't be invoked until the second pass over a script function, so some of these tests weren't correctly comparing the fused output with the interpreter output.  I've used the `checkScript` method where applicable, which seems to do the right thing.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33944

Test Plan: Locally inject obvious errors into the fuser and verify that the updated tests fail when they're supposed to.

Differential Revision: D20162320

Pulled By: bertmaher

fbshipit-source-id: 4a2f3f2d2ff1d81f23db504dc8cd0d5417bdcc50
2020-02-28 17:01:34 -08:00
4377061baf [caffe2] fix atomicAdd redeclaration Clang error (#33559)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33559

For sm_60+ CUDA supports `atomicAdd(double*, double*)` function and for lower compute capabilities the CUDA C Programming Guide [1] suggest a user implementation as in this code. On the other side, Clang's CUDA wrappers unconditionally define this function, regardless of compute capability, and merit an error if it actually get's used.

So the problem is: when Clang is used for < sm_60, CUDA's `atomicAdd(double*, double*)` cannot be used and it cannot be redeclared in standard compliant C++.

Workaround the problem by using Clang's `enable_if` attribute [2], which has a side effect of function redeclaration.

1. https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#atomic-functions
2. https://clang.llvm.org/docs/AttributeReference.html#enable-if

Test Plan:
```lang=bash
buck build mode/opt -c fbcode.cuda_use_clang=true //fblearner/flow/projects/dper:workflow
buck build mode/opt //fblearner/flow/projects/dper:workflow
```
Execute tests on devgpu:
```
buck test mode/dev-nosan -j 8 //caffe2/caffe2/python/operator_test/... //caffe2/test:cuda
```

Reviewed By: ngimel

Differential Revision: D20005113

fbshipit-source-id: d0d4bd6514f201af9cdeba1229bd9b798df0d02e
2020-02-28 15:48:19 -08:00
4fb8679218 [caffe2] fix field initialization after base Clang errors (#33556)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33556

Fix several places exposed by Clang where order of member initializer list doesn't actually match the actual initialization order. The fix is to simply reorder member initializer lists.

Also accepted formatting changes suggested by clang-format linter.

Test Plan:
```lang=bash
buck build mode/opt -c fbcode.cuda_use_clang=true //fblearner/flow/projects/dper:workflow
buck build mode/opt //fblearner/flow/projects/dper:workflow
```
Execute tests on devgpu:
```
buck test mode/dev-nosan -j 8 //caffe2/caffe2/python/operator_test/... //caffe2/test:cuda
```

Reviewed By: ngimel

Differential Revision: D20004834

fbshipit-source-id: b61c7c3f1fe8413bbb3512f6b62177a3ddf67682
2020-02-28 15:42:49 -08:00
991f7a20f2 Use clog from cpuinfo/deps instead of downloading (#33947)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33947

XNNPACK was downloading clog because we weren't setting CLOG_SOURCE_DIR.
Actually, it was downloading cpuinfo and pointing to the copy of clog
within that.  So let's just point to the copy of clog within the cpuinfo
submodule we already have.

(Note: this ignores all push blocking failures!)

Test Plan:
Ran cmake and didn't see any downloading.
Verified that our clog is the same as the one that was being downloaded
with `diff -Naur`.

Differential Revision: D20169656

Pulled By: suo

fbshipit-source-id: ba0f7d1535f702e504fbc4f0102e567f860db94b
2020-02-28 15:19:03 -08:00
69d2741480 Add list of view ops to public doc. (#32560)
Summary:
This PR comes from discussion with albanD in https://fb.quip.com/npBHAXaPfnbu. Main goal is to clarify view ops with general outplace/inplace ops and remind users about the difference.
For reference this information is only available in code which is internal and hard to find. Also changes to this list actually affect users so we think it's better to expose it as public information. It's also helpful for new backend like XLA when implementing PyTorch ops. 19bbb4fccb/tools/autograd/gen_autograd.py (L32-L68)
Please feel free to comment!
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32560

Differential Revision: D20161069

Pulled By: ailzhang

fbshipit-source-id: b5f1fd4353fe7594a427784db288aeb5a37dc521
2020-02-28 15:05:55 -08:00
b678256bfb Move glu to Aten(CPU) (#33179)
Summary:
This PR move glu to Aten(CPU).
Test script:
```
import torch
import torch.nn.functional as F
import time

torch.manual_seed(0)

def _time():
    if torch.cuda.is_available():
        torch.cuda.synchronize()
    return time.time()

device = "cpu"

#warm up
for n in [10, 100, 1000, 10000]:
    input = torch.randn(128, n, requires_grad=True, device=device)
    grad_output = torch.ones(128, n // 2, device=device)
    for i in range(1000):
        output = F.glu(input)
        output.backward(grad_output)

for n in [10, 100, 1000, 10000]:
    fwd_t = 0
    bwd_t = 0
    input = torch.randn(128, n, requires_grad=True, device=device)
    grad_output = torch.ones(128, n // 2, device=device)
    for i in range(10000):
        t1 = _time()
        output = F.glu(input)
        t2 = _time()
        output.backward(grad_output)
        t3 = _time()
        fwd_t = fwd_t + (t2 -t1)
        bwd_t = bwd_t + (t3 - t2)
    fwd_avg = fwd_t / 10000 * 1000
    bwd_avg = bwd_t / 10000 * 1000
    print("input size(128, %d) forward time is %.2f (ms); backwad avg time is %.2f (ms)."
          % (n, fwd_avg, bwd_avg))
```
Test device: **skx-8180.**
Before:
```
input size(128, 10) forward time is 0.04 (ms); backwad avg time is 0.08 (ms).
input size(128, 100) forward time is 0.06 (ms); backwad avg time is 0.14 (ms).
input size(128, 1000) forward time is 0.11 (ms); backwad avg time is 0.31 (ms).
input size(128, 10000) forward time is 1.52 (ms); backwad avg time is 2.04 (ms).
```
After:
```
input size(128, 10) forward time is 0.02 (ms); backwad avg time is 0.05 (ms).
input size(128, 100) forward time is 0.04 (ms); backwad avg time is 0.09 (ms).
input size(128, 1000) forward time is 0.07 (ms); backwad avg time is 0.17 (ms).
input size(128, 10000) forward time is 0.13 (ms); backwad avg time is 1.03 (ms).
```
Fix https://github.com/pytorch/pytorch/issues/24707, https://github.com/pytorch/pytorch/issues/24708.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33179

Differential Revision: D19839835

Pulled By: VitalyFedyunin

fbshipit-source-id: e4d3438556a1068da2c4a7e573d6bbf8d2a6e2b9
2020-02-28 14:54:38 -08:00
3c5677a676 Use codegen'ed unboxing wrappers (#32521)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32521

Not all ops support the templated unboxing wrappers yet. For the ones that don't,
let's use the codegen'ed unboxing wrappers from register_aten_ops.cpp, but register
them with c10 directly instead of JIT.

The `use_c10_dispatcher` setting in `native_functions.yaml` now has a new option 'with_codegenerated_unboxing_wrapper' which means we take the codegened unboxing wrapper from register_aten_ops.cpp and stuff it into c10. This new argument is the default, 'unboxed_only' is not the default anymore. For the (very few) ops that don't support boxed dispatch yet (i.e. ops taking TensorOptions arguments), we set them to 'unboxed_only' and they follow the old behavior of having register_aten_ops.cpp register the jit op.

Next steps here are (1) to make TensorOptions work with boxed dispatch and remove the `unboxed_only` option from `use_c10_dispatcher`, so that all ops go through the new path and (2) make the new path template-only and remove codegen from it (see https://github.com/pytorch/pytorch/issues/32366).

First experiments show that
- For a small JITted model that calls add (i.e. a op with just two arguments that are both tensors) on two tensors in a loop, we see a 2-4% performance improvement (~35-50ns) when compared to the old path. This is a simple op that takes two tensor arguments and no non-tensor arguments, so iterating over it in boxed dispatch is cheap.
- For a small JITted model that calls avgpool1d (i.e. an op that has one tensor arg and 5 non-tensor args) on a tensor in a loop, we see a 3-4% performance regression (~60ns) when compared to the old path. This is an op that takes only one tensor argument and then 6 non-tensor arguments. Unboxed dispatch doesn’t have to look at those but boxed dispatch still needs to iterate over them.

This performance difference is likely due to boxed dispatch iterating over all arguments in a loop and unboxed dispatch not having to look at non-tensor arguments.

ghstack-source-id: 99161484

Test Plan: unit tests that call existing ops through JIT

Differential Revision: D18672405

fbshipit-source-id: bf2a7056082dfad61e7e83e9eeff337060eb6944
2020-02-28 14:48:25 -08:00
2fa51dde28 Remove unnecessary tensor copies (#33732)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33732

move and forward instead of copy

Benchmarks:
A microbenchmark calling the add operation on two tensors in a tight loop shows a 5% improvement in performance.
No visible change for a model like resnet that does more work in its kernels.
ghstack-source-id: 99161486

Test Plan: benchmarks

Differential Revision: D20082642

fbshipit-source-id: eeac59686f8621dd5eaa85d61e6d219bba48c847
2020-02-28 14:47:04 -08:00
917e56e950 Throw an error if nbytes is called on a sparse tensor. (#33897)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/33897

Test Plan: Imported from OSS

Differential Revision: D20146388

Pulled By: gchanan

fbshipit-source-id: b5853096e290fa7fb50be41446b138ebdf71009f
2020-02-28 14:12:50 -08:00
f5d92fbc25 Get rid of newWithStorage2d calls. (#33823)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/33823

Test Plan: Imported from OSS

Differential Revision: D20122448

Pulled By: gchanan

fbshipit-source-id: b249372c93ee71b84a293dfb5c298a8fb664da16
2020-02-28 14:07:44 -08:00
56d9906083 update mapping of fake operators (#33946)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33946

update mapping of fake operators to model nnpi
update SpatialBN to non-lowered

Test Plan:
compilation

https://github.com/pytorch/pytorch/pull/33946

Reviewed By: amylittleyang

Differential Revision: D20156136

fbshipit-source-id: e6ed87c3c5eba692a49376f0d9dae37ae185f185
2020-02-28 14:01:02 -08:00
ad44394f15 Updating submodules
Summary:
GitHub commits:

e5b1164ad7
6df461c14e
41535d0218
30c57a1a0e
3b9aeb2ebe

Test Plan: n/a

Reviewed By: wittgenst

fbshipit-source-id: 8361b5814c531edc99f96f11db97d6b2adcc5280
2020-02-28 13:29:48 -08:00
9fd1a7697f Create CODE_OF_CONDUCT.md 2020-02-28 13:20:00 -08:00
a726827ec8 Formatting changes for gradient scaling (#33832)
Summary:
hard to get right locally...I can build the docs but never quite match what it looks like live.  the bullet point indentation was just an oversight.

Removing `Returns:` formatting tabs because they take up a lot of space when rendered and add no clarity.  Some functions in Pytorch [do use them](https://pytorch.org/docs/master/torch.html#torch.eye), but [many don't bother](https://pytorch.org/docs/master/torch.html#torch.is_tensor), so apparently some people shared my feelings (Not using them is in line with existing practice).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33832

Differential Revision: D20135581

Pulled By: ngimel

fbshipit-source-id: bc788a7e57b142f95c4fa5baf3fe01f94c45abd8
2020-02-28 11:40:48 -08:00
5dde8cd483 [caffe2] fix no matching function min/max Clang errors (#33563)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33563

When NVCC or Clang are driving CUDA compilation many math functions are declared by default, with a small difference: Clang marks them as `__device__` only, while NVCC uses both `__host__` and `__device__`. This makes every un-elaborated `min` or `max` function call from a `__host__` function generate a syntax error when Clang is used.

Fix the errors by using `std::min` and `std::max` from `<algorithm>`, since C++14 they are `constexpr` and can be used in the `__device__` code [1].

1. https://llvm.org/docs/CompileCudaWithLLVM.html#algorithm

Test Plan:
```lang=bash
buck build mode/opt -c fbcode.cuda_use_clang=true //fblearner/flow/projects/dper:workflow
buck build mode/opt //fblearner/flow/projects/dper:workflow
```
Execute tests on devgpu:
```
buck test mode/dev-nosan -j 8 //caffe2/caffe2/python/operator_test/... //caffe2/test:cuda
```

Reviewed By: ngimel

Differential Revision: D20005795

fbshipit-source-id: 98a3f35e8a96c15d3ad3d2066396591f5cca1696
2020-02-28 11:33:24 -08:00
c6d301220a Fix torch.cat() performance regression on single core CPU (#33534)
Summary:
This PR addresses the performance regression on `torch.cat()` on CPU with single thread.
Previous optimization https://github.com/pytorch/pytorch/issues/30806 introduced regression for several cases on pytorch operator benchmark.
See https://github.com/pytorch/pytorch/issues/33334 for detail.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33534

Differential Revision: D20129963

Pulled By: VitalyFedyunin

fbshipit-source-id: 3fa6cd266978e5b54fa37105555502b77352df3e
2020-02-28 11:22:08 -08:00
890242254b Updating submodules
Summary:
GitHub commits:

6f4df6e0cd
6b7df86da1
f873713ad6
2b3b76cc4d
b990727d33

Test Plan: n/a

Reviewed By: wittgenst

fbshipit-source-id: bf7b1639ee23e1e823bc2217f56c87dc7befaf7f
2020-02-28 10:42:20 -08:00
04dc0e6973 Split Distribution.cu into smaller files to reduce compilation time. (#33892)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/33892

Test Plan: Imported from OSS

Differential Revision: D20148925

Pulled By: gchanan

fbshipit-source-id: 955e6ff22ee5fb24000b9f2ee58a243e76edf993
2020-02-28 09:21:51 -08:00
dece155335 Modified assertEqual to handle complex tensors (#33773)
Summary:
- Modified assertEqual to handle complex tensors
- added a test in test_torch.py to test torch.zeros
- added dispatch for complex for index_kernel, index_put_kernel
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33773

Differential Revision: D20135553

Pulled By: anjali411

fbshipit-source-id: f716604535c0447ecffa335b0fc843431397c988
2020-02-28 08:43:28 -08:00
09046713cc removed .data from test_autograd.py (#33886)
Summary:
issue: https://github.com/pytorch/pytorch/issues/33630
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33886

Differential Revision: D20160292

Pulled By: anjali411

fbshipit-source-id: 14a42d8148bd60db2dd8ec39f83f99c061ae19c1
2020-02-28 08:24:07 -08:00
f5f1e5e7f6 [quant][graphmode][refactor] Factor out getInvokedMethod (#33649)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/33649

Test Plan:
.

Imported from OSS

Differential Revision: D20123589

fbshipit-source-id: 0853d757434fb85c6d86666ff9fc991f8c4cb4bc
2020-02-27 23:48:09 -08:00
7f1112820a [quant][graphmode][refactor] Move check for weight outside of insertObserverFor (#33276)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33276

att

Test Plan:
.

Imported from OSS

Differential Revision: D20123593

fbshipit-source-id: 45dc8488ddf02225ba2c20374c9385edd77a4912
2020-02-27 23:48:04 -08:00
7c13f576ea [quant][graphmode][refactor] Checks for bias and weight (#33273)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33273

- Move the check for bias to valueNeedsToBeQuantized
- Move TORCH_CHECK inside the functions for checking if a value is bias or weight

Test Plan:
.

Imported from OSS

Differential Revision: D20123595

fbshipit-source-id: 4b805d57dcaf41a6436506d021dd5f6518bc88fd
2020-02-27 23:47:59 -08:00
97541a5106 [quant][graphmode][refactor] Move values_to_skip check inside valueNeedsToBeQuantized (#33275)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33275

att

Test Plan:
.

Imported from OSS

Differential Revision: D20123592

fbshipit-source-id: 2b56ea8bab27eb9ea2bf792c83e48a7af8917e1a
2020-02-27 23:46:29 -08:00
64aab3260a [jit] allow RRef local creation with IValue objects (#33263)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33263

This PR allow PyRRef local creation to inspect the pyobject, if it
founds that we could turn it to an IValue, turn to an IValue first,
otherwise hold it as a PyObjectType

Test Plan:
Imported from OSS

https://fb.quip.com/aGxRAh2lCg05

Differential Revision: D19871243

Pulled By: wanchaol

fbshipit-source-id: ae5be3c52fb1e6db33c64e64ef64bc8b9ea63a9a
2020-02-27 22:49:53 -08:00
1507573a52 [caffe2] fix no return statement in constexpr function Clang error in TypeIndex.h (#33576)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33576

`throw` statement at the end of `constexpr` is ill-formed according to Clang. It happens when Clang is driving CUDA compilation and compiles for device the effected code. Due to its compilation model it requires host code to be well-formed even when compiling for device.

Fix the error by guarding the entire definition of `type_index_impl` with `__CUDA_ARCH__` check.

Test Plan:
```lang=bash
buck build mode/opt -c fbcode.cuda_use_clang=true //fblearner/flow/projects/dper:workflow
buck build mode/opt //fblearner/flow/projects/dper:workflow
```
Execute tests on devgpu:
```
buck test mode/dev-nosan -j 8 //caffe2/caffe2/python/operator_test/... //caffe2/test:cuda
```

Reviewed By: smessmer

Differential Revision: D20008881

fbshipit-source-id: b0dc9abf0dc308b8b8637b54646a0411baf7fef3
2020-02-27 22:29:58 -08:00
c18cb1eb52 Improve dll loading logic on Windows (#33856)
Summary:
The way it works on the Anaconda distribution of Python 3.8 is a bit different. Loading DLLs explicitly  (e.g. `ctype.CDLL`) relies on paths appended by `os.add_dll_directory`. But if you try to load DLLs implicitly (e.g. `from torch._C import *`), it will rely on `PATH`.

Fixes https://github.com/pytorch/vision/issues/1916.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33856

Differential Revision: D20150080

Pulled By: soumith

fbshipit-source-id: cdbe76c138ea259ef7414c6634d4f7e0b1871af3
2020-02-27 21:58:35 -08:00
cb8d9f99aa [JIT] Implement Tensor.tolist() (#33472)
Summary:
**Summary**
This commit adds an implementation of `Tensor.tolist()` to the JIT interpreter.

**Testing**
This commit adds several unit tests that test that this function works correctly for
0D, 1D, 2D and 3D tensors of type `float`, `int` and `bool`.

```
(base) meghanl-mbp:pytorch meghanl$ python test/test_jit.py TestList.test_to_list -v
Fail to import hypothesis in common_utils, tests are not derandomized
test_to_list (jit.test_list_dict.TestList)
Unit tests for Tensor.tolist() function. ... ok

----------------------------------------------------------------------
Ran 1 test in 0.329s

OK
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33472

Differential Revision: D20109738

Pulled By: SplitInfinity

fbshipit-source-id: a6e3fee5e3201d5e1f0c4ca45048488ae2bf5e33
2020-02-27 21:45:46 -08:00
5029ff001b [Revert] manual revert of D19918320 (#33920)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33920

revert D19918320

Test Plan: revert diff

Reviewed By: zhaojuanmao

Differential Revision: D20151299

fbshipit-source-id: c346554ae9074991331479e434e54b0cc513f1a4
2020-02-27 21:22:36 -08:00
8f84deddd1 [jit] fix up refs in overview.md (#33919)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/33919

Test Plan: Imported from OSS

Differential Revision: D20154953

Pulled By: suo

fbshipit-source-id: 2ef83cce8da88212bed7edc813c9b233267ea81b
2020-02-27 19:22:51 -08:00
d6485b411b [jit] add top-level readme to csrc/jit (#33916)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/33916

Test Plan: Imported from OSS

Differential Revision: D20150771

Pulled By: suo

fbshipit-source-id: c7550954ddd6a294ce833348bf9fa058503e9bd7
2020-02-27 19:21:05 -08:00
bd7e9c490a [jit] stop printing crap in test_jit (#33917)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/33917

Test Plan: Imported from OSS

Differential Revision: D20150750

Pulled By: suo

fbshipit-source-id: 9a35298a8856d423fb6b9043174853cccf968706
2020-02-27 19:06:43 -08:00
d66c320b10 disable leaky_relu_ backward calculation with negative slope (#33639)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/33639

Test Plan: Imported from OSS

Differential Revision: D20045735

Pulled By: glaringlee

fbshipit-source-id: b3becf30a8fe9ee178792bd88f6ee10102504ed5
2020-02-27 18:54:57 -08:00
997b5b5797 [quant][graphmode][refactor] Simplify signature for insertObserverFor (#33274)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33274

att

Test Plan:
.

Imported from OSS

Differential Revision: D20123588

fbshipit-source-id: e656d96e0b6004bfcca5df2ab222184d4e1dd6ad
2020-02-27 17:24:41 -08:00
db4a24e008 [jit] remove some unused/redundant files (#33806)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33806

as title

Test Plan: Imported from OSS

Differential Revision: D20122117

Pulled By: suo

fbshipit-source-id: 209d29ed2c873181140c9fb5cdc305c200ce4008
2020-02-27 17:16:12 -08:00
877ab3afe3 Better handing of Autograd+Fork errors. (#33885)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33885

Fixes: #32835
Fixes: #5834

Can not combine with CUDA's implementation as each of them requires individual `std::once_flag` as well as different `forked_autograd_child` functions. CUDA version relays to python module, autograd uses TORCH_CHECK to report error to python and cpp.

Test Plan: Imported from OSS

Differential Revision: D20144024

Pulled By: VitalyFedyunin

fbshipit-source-id: e7cf30568fff5110e9df7fe5b23f18ed992fa17f
2020-02-27 16:07:29 -08:00
746e5218e7 Mistake in MSELoss documentation (#33836)
Summary:
Replaced `sum` with `mean` in [line 392](https://github.com/pytorch/pytorch/blob/master/torch/nn/modules/loss.py#L392)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33836

Differential Revision: D20142053

Pulled By: ailzhang

fbshipit-source-id: 2bfe19944ffc5534902dd9087023e70ddf5746c3
2020-02-27 15:34:46 -08:00
48fd410e44 Try fix XLAPreAutograd with *_like functions. (#33848)
Summary:
In *_like functions we call
`globalLegacyTypeDispatch().initForDispatchKeySet(c10::detail::multi_dispatch_key_set(self, options));` -> `dispatchKeyToBackend` and thus this change.
`self` has both `XLAPreAutograd` and `XLATensorId` in key set.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33848

Differential Revision: D20135898

Pulled By: ailzhang

fbshipit-source-id: a8585f39f3fa77b53718f20d3144f4f2f3cb8e53
2020-02-27 15:28:40 -08:00
87e97ced20 Split UnaryOpsKernel into smaller files for faster compilation. (#33888)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/33888

Test Plan: Imported from OSS

Differential Revision: D20143653

Pulled By: gchanan

fbshipit-source-id: de708030e93e96091e0c01a89b4342872d0657dd
2020-02-27 15:13:01 -08:00
aff1da5aac .circleci: Remove trailing slash, fix conda upload (#33903)
Summary:
Conda registers a suffixed slash as a new user so it was failing to
upload the anaconda packages.

In the future this should be handled through a single variable that can
be used for both but until then this will have to do.

Bug was introduced in https://github.com/pytorch/pytorch/issues/33842

Signed-off-by: Eli Uriegas <eliuriegas@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33903

Differential Revision: D20148679

Pulled By: seemethere

fbshipit-source-id: 27c95f5d906ce84aa34bf5d76fd6f1ef5df08fb9
2020-02-27 14:56:02 -08:00
a7fe200f5f [caffe2] simplify caffe2 code with fbgemm handling block size 1 emb (#33774)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33774

Simplify caffe2 code using D19246900

Test Plan: CI

Reviewed By: jianyuh

Differential Revision: D20102410

fbshipit-source-id: 8de4d9cfac66898db0718ac6477339fd5e5428e3
2020-02-27 14:45:28 -08:00
524dad13a8 Add device to the test tensor. Default device type is CPU, in pytorch… (#33635)
Summary:
…/xla this will result in a failure since it is comparing a XLA tensor with a CPU tensor.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33635

Differential Revision: D20043517

Pulled By: ailzhang

fbshipit-source-id: d84038ea675e4d4a9c02e7a8b0924bdb12f40501
2020-02-27 14:40:07 -08:00
edd5c009f7 fix docs mistakes in lr_scheduler.MultiplicativeLR (#33805)
Summary:
This PR is referenced to an issue: [The docs of `MultiplicativeLR` use `LambdaLR` as example](https://github.com/pytorch/pytorch/issues/33752#issue-570374087)

https://github.com/pytorch/pytorch/issues/33752
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33805

Differential Revision: D20121314

Pulled By: mruberry

fbshipit-source-id: 5afa63bbe83d35ce4e55705b8cbd96326a907651
2020-02-27 14:11:57 -08:00
d97560999b Split BinaryCompareKernel.cu into a file-per-kernel to speed up compilation. (#33871)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/33871

Test Plan: Imported from OSS

Differential Revision: D20140862

Pulled By: gchanan

fbshipit-source-id: a4fde38c1c7c5905e3855fa490ea2e87bb24c703
2020-02-27 13:48:36 -08:00
5eacdfb21f Revert D20127441: [pytorch][PR] [JIT] Introduce a fake Tensor creation node for IR unit tests
Test Plan: revert-hammer

Differential Revision:
D20127441

Original commit changeset: 56da4f23ac46

fbshipit-source-id: 7d4602e5011bec6f6871eab16af05a3198694e5d
2020-02-27 13:48:31 -08:00
c4d611a0f5 Split BinaryMiscOpsKernels into more files for faster build times. (#33873)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/33873

Test Plan: Imported from OSS

Differential Revision: D20140974

Pulled By: gchanan

fbshipit-source-id: 88b982881e8034f3b03cdb6911ae4239d2bb1596
2020-02-27 13:47:06 -08:00
910acafc79 Revert D20124224: [jit] stop printing crap in test_jit
Test Plan: revert-hammer

Differential Revision:
D20124224

Original commit changeset: 9241d21fdf94

fbshipit-source-id: 0680f9db922f9a33a4e859eedd142b87a51bbede
2020-02-27 13:40:34 -08:00
53630f7681 Updating submodules
Summary:
GitHub commits:

ae68f84fcd
6cb0beaf0e
401fb54029
fe8777e593
44fcf005eb
72ee067b90
01a3c124d4
c94f8f43b9
a09b292a28
472e40a902
967d4bc051

Test Plan: n/a

Reviewed By: wittgenst

fbshipit-source-id: e8e43b1cbc365fd7f5b068d625c4020240358690
2020-02-27 13:35:14 -08:00
243af17d65 Revert D20103905: [jit] Fix flipped PackedSequence outputs in script
Test Plan: revert-hammer

Differential Revision:
D20103905

Original commit changeset: 84081213ed21

fbshipit-source-id: 2b260654fac87e52fbaf8035018e4ea484928af1
2020-02-27 13:29:35 -08:00
a7cf5c859f Revert D20136865: fix lint
Test Plan: revert-hammer

Differential Revision:
D20136865

Original commit changeset: 4bf7ac324a0a

fbshipit-source-id: 94cc83cda180f744cec174d269f1b82babff0e5c
2020-02-27 13:21:44 -08:00
908eee5583 remove .data from test/distributed/ (#33874)
Summary:
`.data` calls are unsafe and should not be used.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33874

Differential Revision: D20141059

Pulled By: izdeby

fbshipit-source-id: 8e11afc74f0cb04f5b18b458068fb813a6d51708
2020-02-27 13:14:29 -08:00
390d4d6df3 [JIT] Introduce a fake Tensor creation node for IR unit tests (#33595)
Summary:
**Summary**
There is often a need to create a Tensor when writing IR by hand for JIT
optimisation pass unit tests. The only options for this today are real
Tensor creation functions like `aten::ones`. Any test that uses these functions
must also use the same default arguments as the Python/C++ API, which means
that all of the tests have to be updated when the API is updated. This commit
introduces a new primitive, `prim::MakeTestTensor` with schema `() -> Tensor` that
should be used in unit tests instead of real Tensor creation functions. This new
primitive has no public-facing API, so the maintenance burden is much lower.

**Testing**
This commit updates the alias analysis and DCE tests to use `prim::MakeTestTensor` instead of
`aten::rand`, `aten::ones`, and `aten::zeros`.

```
$ ./bin/test_jit
CUDA not available. Disabling CUDA and MultiCUDA tests
Note: Google Test filter = *-*_CUDA:*_MultiCUDA
[==========] Running 75 tests from 1 test case.
[----------] Global test environment set-up.
[----------] 75 tests from JitTest
[ RUN      ] JitTest.ADFormulas
[       OK ] JitTest.ADFormulas (82 ms)
[ RUN      ] JitTest.Attributes
[       OK ] JitTest.Attributes (0 ms)
...
...
...
[ RUN      ] JitTest.LiteInterpreterPrim
[       OK ] JitTest.LiteInterpreterPrim (0 ms)
[ RUN      ] JitTest.LiteInterpreterLoadOrigJit
[       OK ] JitTest.LiteInterpreterLoadOrigJit (2 ms)
[----------] 75 tests from JitTest (150 ms total)

[----------] Global test environment tear-down
[==========] 75 tests from 1 test case ran. (150 ms total)
[  PASSED  ] 75 tests.
```

**Fixes**
This pull request fixes https://github.com/pytorch/pytorch/issues/33500.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33595

Differential Revision: D20127441

Pulled By: SplitInfinity

fbshipit-source-id: 56da4f23ac46335227254f606c6481718108f378
2020-02-27 13:10:20 -08:00
dbe850af5b [jit] do the code reorg (#33851)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33851

Rationale and context described in #33828.

Script to reproduce the move:
https://gist.github.com/suo/16cbefaaeb67ca5a7c6caffd49b7f6e9
ghstack-source-id: 99079645

Test Plan: Make sure CI passes

Reviewed By: jamesr66a

Differential Revision: D20133869

fbshipit-source-id: 390e9241a9c85366d9005c492ac31f10aa96488e
2020-02-27 13:02:51 -08:00
afbd04449e [quant][graphmode] Swap dequantize after inline for ops that doesn't require observation (#33173)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33173

How to deal with ops that’s defined for both floating point and quantized Tensor?

Category of ops: the ones that doesn’t require observers, which means the quantization parameters(scale/zero_point) of the output of this op can be inferred from the quantization parameters of inputs.
For example:
avg_pool, max_pool, flatten, transpose, upsample

Another related topic to previous one is how do we deal with things like adaptive_avg_pool2d that does not require to be observed and it works with quantized tensor as well? If we insert quant/dequant for them, even the quant fusion becomes a numerically changing operation because the scale/zero_point for input and output are different.

Proposal

We can swap the operator with dequantize whenever we see it. For example, for pattern
Let’s say aten::general_op is defined for both floating point and quantized

%r = aten::conv(...)
%q = quantize(%r)
%dq = dequantize(%q)
%f = aten::general_op(%dq)
...

We detect that all inputs of aten::general_op is produced by dequantize, we’ll first delete all the dequantize for the inputs and then insert dequantize for each use of the output of the aten::general_op, note that this should work generally for all the case we might encounter.

After transformation we’ll have:

%r = aten::conv(...)
%q = quantize(%r)
%x = aten::general_op(%q)
%f = dequantize(%x)
...

1. Multiple inputs
    1. We need to make sure all inputs of the aten::general_op are produced by dequantize before we do this transformation
2. Input used by multiple operators
    1. We already did this by inserting dequantize for each use of the value
3. Output used by multiple operators
    1. We’ll reuse the code that inserts dequantize(might need some refactor)

Note that current concat does not belong to this category right now since it does not inherit quantization parameters from inputs.

Test Plan:
python test/test_jit.py

Imported from OSS

Differential Revision: D20123590

fbshipit-source-id: de2febe1f37e4079457a23acaeccbc6d9c9e1f8a
2020-02-27 12:42:29 -08:00
6647a44e8c Automatic update of fbcode/onnx to 9fdae4c68960a2d44cd1cc871c74a6a9d469fa1f (#33858)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33858

Previous import was 04a29addfd5b912812addb8dea5f8763fbfaad01

Included changes:
- **[9fdae4c6](https://github.com/onnx/onnx/commit/9fdae4c6)**: Copy sizes in some optimizers to remain shape information (#2574) <daquexian>
- **[c978d102](https://github.com/onnx/onnx/commit/c978d102)**: Implement CELU node as a Function (#2575) <Jeremy Cochoy>
- **[c677aef4](https://github.com/onnx/onnx/commit/c677aef4)**: Fix CI build break (#2603) <Changming Sun>
- **[d343755d](https://github.com/onnx/onnx/commit/d343755d)**: Allow function body to rely on other operator sets (#2597) <Ke Zhang>

Test Plan: ci

Reviewed By: hl475

Differential Revision: D20135343

fbshipit-source-id: d719c4ba2ae26892a5fa921691c84eba64b59291
2020-02-27 12:40:39 -08:00
bd77abffe3 Kill some unused (TH)Storage-based APIs. (#33815)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/33815

Test Plan: Imported from OSS

Differential Revision: D20119333

Pulled By: gchanan

fbshipit-source-id: 15042ca0fabdc88b53d662b6dd964968f64997f4
2020-02-27 12:23:25 -08:00
b10761d890 fix type stub errors (#33762)
Summary:
I've been using pytorch with type hintings, and I found errors that can be easily fixed. So I'm creating this PR to fix type bugs.

I expected below code should be type-checked without any errors.

```python
import torch
from torch.nn import Linear
from torch.autograd import Variable
from torch.optim import AdamW
from torch.utils import hooks

# nn.Module should have training attribute
module = Linear(10, 20)
module.training

# torch should have dtype bfloat16
tensor2 = torch.tensor([1,2,3], dtype=torch.bfloat16)

# torch.Tensor.cuda should accept int or str value
torch.randn(5).cuda(1)
torch.tensor(5).cuda('cuda:0')

# optimizer should have default attribute
module = Linear(10, 20)
print(AdamW(module.weight).default)

# torch.Tensor should have these boolean attributes
torch.tensor([1]).is_sparse
torch.tensor([1]).is_quantized
torch.tensor([1]).is_mkldnn

# Size class should tuple of int
a, b = torch.tensor([[1,2,3]]).size()

# check modules can be accessed
torch.nn.parallel
torch.autograd.profiler
torch.multiprocessing
torch.sparse
torch.onnx
torch.jit
torch.hub
torch.random
torch.distributions
torch.quantization
torch.__config__
torch.__future__

torch.ops
torch.classes

# Variable class's constructor should return Tensor
def fn_to_test_variable(t: torch.Tensor):
    return None

v = Variable(torch.tensor(1))
fn_to_test_variable(v)

# check RemovableHandle attributes can be accessed
handle = hooks.RemovableHandle({})
handle.id
handle.next_id

# check torch function hints
torch.is_grad_enabled()
```

But current master branch raises errors. (I checked with pyright)

```
$ pyright test.py
Searching for source files
Found 1 source file
test.py
  12:45 - error: 'bfloat16' is not a known member of module
  15:21 - error: Argument of type 'Literal[1]' cannot be assigned to parameter 'device' of type 'Optional[device]'
  'int' is incompatible with 'device'
  Cannot assign to 'None'
  16:22 - error: Argument of type 'Literal['cuda:0']' cannot be assigned to parameter 'device' of type 'Optional[device]'
  'str' is incompatible with 'device'
  Cannot assign to 'None'
  23:19 - error: Cannot access member 'is_sparse' for type 'Tensor'
  Member 'is_sparse' is unknown
  24:19 - error: Cannot access member 'is_quantized' for type 'Tensor'
  Member 'is_quantized' is unknown
  25:19 - error: Cannot access member 'is_mkldnn' for type 'Tensor'
  Member 'is_mkldnn' is unknown
  32:7 - error: 'autograd' is not a known member of module
  33:7 - error: 'multiprocessing' is not a known member of module
  34:7 - error: 'sparse' is not a known member of module
  35:7 - error: 'onnx' is not a known member of module
  36:7 - error: 'jit' is not a known member of module
  37:7 - error: 'hub' is not a known member of module
  38:7 - error: 'random' is not a known member of module
  39:7 - error: 'distributions' is not a known member of module
  40:7 - error: 'quantization' is not a known member of module
  41:7 - error: '__config__' is not a known member of module
  42:7 - error: '__future__' is not a known member of module
  44:7 - error: 'ops' is not a known member of module
  45:7 - error: 'classes' is not a known member of module
  60:7 - error: 'is_grad_enabled' is not a known member of module
20 errors, 0 warnings
Completed in 1.436sec
```

and below list is not checked as errors, but I think these are errors too.

* `nn.Module.training` is not boolean
* return type of `torch.Tensor.size()` is `Tuple[Unknown]`.

 ---

related issues.

https://github.com/pytorch/pytorch/issues/23731, https://github.com/pytorch/pytorch/issues/32824, https://github.com/pytorch/pytorch/issues/31753
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33762

Differential Revision: D20118884

Pulled By: albanD

fbshipit-source-id: 41557d66674a11b8e7503a48476d4cdd0f278eab
2020-02-27 06:58:53 -08:00
095de1e872 Migrate random_ from the TH to Aten (CPU and CUDA) (#33663)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/33663

Test Plan: Imported from OSS

Differential Revision: D20056350

Pulled By: pbelevich

fbshipit-source-id: f9859b79ffdec70c48d6ee3ec70fd6fad593a9f5
2020-02-27 05:05:42 -08:00
f5952cf7cb fix lint (#33861)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/33861

Test Plan: Imported from OSS

Differential Revision: D20136865

Pulled By: suo

fbshipit-source-id: 4bf7ac324a0abce9b45121ac5ab438448a6f3149
2020-02-27 00:33:51 -08:00
9733711394 [JIT] Support calling Tensor.element_size() in TorchScript (#33808)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33808

# Problem

https://github.com/pytorch/pytorch/issues/33620
ghstack-source-id: 99073701

Test Plan:
```
buck test mode/dev-nosan //caffe2/test:jit -- test_numel

buck test mode/dev-nosan //caffe2/test:jit -- test_element_size

buck build mode/dev-nosan //caffe2/test:jit \
&& buck-out/gen/caffe2/test/jit\#binary.par -r test_numel

buck build mode/dev-nosan //caffe2/test:jit \
&& buck-out/gen/caffe2/test/jit\#binary.par -r test_element_size
```

Compile error

P126667043

Generated code,
```
buck-out/dev/gen/caffe2/generate-code=register_aten_ops_0.cpp/register_aten_ops_0.cpp

buck-out/dev/gen/caffe2/generate-code=register_aten_ops_2.cpp/register_aten_ops_2.cpp
```
P126667064

Differential Revision: D7050644

fbshipit-source-id: 20dbdb9c500b6d7683c23e3049d43ed0ca06d831
2020-02-26 22:30:44 -08:00
00f685d2d8 Add Scalar::type() (#33603)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33603

This function returns ScalarType based on its value. This is helpful
to avoid code generated in aten_op.h has returned Scalars depending on
arg self to determine its type.

Test Plan: Imported from OSS

Differential Revision: D20100218

Pulled By: ezyang

fbshipit-source-id: 337729a7559e6abb3a16b2a563a2b92aa96c7016
2020-02-26 22:25:18 -08:00
d41c8d0461 Correctly preserve "not set anywhere" TensorOptions when merging. (#33510)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33510

Previously, we would fill in TensorOption with defaults whenever an
item was missing from both the left and right side of the merge.  This
is morally incorrect: if we don't have an item on the left or right,
we should keep the entry empty (so the downstream user can apply
the appropriate defaulting rule).

I don't think this caused any bugs, but I noticed this error when
working on a later patch in my diff stack.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Differential Revision: D20001775

Pulled By: ezyang

fbshipit-source-id: 88139fc268b488cd1834043584a0d73f46c8ecaa
2020-02-26 21:46:39 -08:00
ca002a0f6b Switch empty_like to use merge_in to process TensorOptions. (#33505)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33505

This shouldn't change semantics, but it has the benefit of making
torch::empty_like(x, dtype(kFloat)) actually work (previously, this
would just ignore all of the properties from x).

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Differential Revision: D20001776

Pulled By: ezyang

fbshipit-source-id: ba81186d3293abc65da6130b2684d42e9e675208
2020-02-26 21:44:33 -08:00
84101f353e Avoid problematic pickle usages on Python 3.8.0 and 3.8.1 (#33824)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/32289

This has been fixed upstream as of Python 3.8.2. I think the easiest and least invasive way to ameliorate this is to catch the error condition and print a more informative error asking the user to update their Python version. It might be possible to buffer the data into memory and then read from memory, but that would be an invasive change and might cause memory exhaustion for very large models.

Suggestions for alternate fixes or ways to improve the error message wording are very welcome.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33824

Differential Revision: D20131722

Pulled By: ezyang

fbshipit-source-id: a6e3fbf4bf7f9dcce5772b36f7a622cbf14b5ae4
2020-02-26 21:15:38 -08:00
421e3e9a54 Release GIL for RPC pybind functions. (#33610)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33610

Our pybind definitions for several RPC functions didn't release GIL
once we were processing stuff in C++.

This PR adds asserts that we release GIL appropriately and adds
py::gil_scoped_release and py::gil_scoped_acquire in the appropriate places.
ghstack-source-id: 99066749

Test Plan: waitforbuildbot

Differential Revision: D20025847

fbshipit-source-id: 57a778cba0336cf87352b07c89bbfb9254c4bdd7
2020-02-26 20:56:06 -08:00
cea0cc8ca8 [jit] Unify augmented assign handling (#33578)
Summary:
Stacked PRs
 * **#33578 - [jit] Unify augmented assign handling**
 * #32993 - [jit] Fix aug assign for non-tensor attributes

We handle augmented assignments to `Select` and `Var` statements differently, but the actual in place update is the same for both, so this PR factors it out into a method so we don't have 2 code paths doing the same thing.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/33578

Pulled By: driazati

Differential Revision: D20127647

fbshipit-source-id: 94f37acbd2551498de9d2ca09a514508266f7d31
2020-02-26 19:13:15 -08:00
24dd800e6a [Dist Autograd] Functional API for Dist Autograd and Dist Optimizer (#33711)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33711

Fixed #33480

This makes `dist_autograd.backward` and `dist_optimizer.step` functional by making the user explicitly pass in the `context_id` as opposed to relying on the confusing thread_local context_id.

This diff incorporates these API changes and all places where these functions are called.

More concretely, this code:

```
with dist_autograd.context():
    # Forward pass.
    dist_autograd.backward([loss.sum()])
    dist_optim.step()
```

should now be written as follows:

```
with dist_autograd.context() as context_id:
    # Forward pass.
    dist_autograd.backward(context_id, [loss.sum()])
    dist_optim.step(context_id)
```

Test Plan: Ensuring all existing dist_autograd and dist_optimizer tests pass with the new API. Also added a new test case for input checking.

Differential Revision: D20011710

fbshipit-source-id: 216e12207934a2a79c7223332b97c558d89d4d65
2020-02-26 19:08:28 -08:00
4c33222c51 [quant][graphmode] Replicate dequantize nodes (#33531)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33531

We already insert dequantize for each use of the value, but there might still be cases where we only
see the value is used multiple times after inline. This pass adds the support to replicate dequantize
after inline to ensure output of dequantize is only used by one node, which is necessary to preserve all
quantization patterns like `dequant - conv - quant`

Test Plan:
python test/test_jit.py

Imported from OSS

Differential Revision: D20123591

fbshipit-source-id: 6edb10a4566538bcf9379d332233f870372b7a63
2020-02-26 18:59:16 -08:00
2b9fa4a756 [jit] Fix flipped PackedSequence outputs in script (#32955)
Summary:
Stacked PRs
 * **#32955 - [jit] Fix flipped PackedSequence outputs in script**
 * #32953 - [jit] Support properties on `Device`

Fixes #32605
](https://our.intern.facebook.com/intern/diff/20103905/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32955

Pulled By: driazati

Differential Revision: D20103905

fbshipit-source-id: 84081213ed214846e563b9f05bcab0210bb1a71b
2020-02-26 18:53:27 -08:00
150e025be8 [jit] stop printing crap in test_jit (#33779)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33779

This should eliminate random warnings and print spew from test_jit.

It also fixes a bug where we weren't properly comparing captured outputs
(!)

Test Plan: Imported from OSS

Differential Revision: D20124224

Pulled By: suo

fbshipit-source-id: 9241d21fdf9470531b0437427b28e325cdf08d3a
2020-02-26 18:46:03 -08:00
4dad00b64b [rpc] special case tensor type check when getting RRef (#33582)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/33582

Test Plan: Imported from OSS

Differential Revision: D20009837

Pulled By: wanchaol

fbshipit-source-id: 7e9ab87d4dddb822c7575891a2b620eff83bfa00
2020-02-26 18:44:40 -08:00
d494986171 [jit] make RRef type annotation available in Python (#33526)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/33526

Test Plan: Imported from OSS

Differential Revision: D19988848

Pulled By: wanchaol

fbshipit-source-id: aeebc946d08b38dac0b656617bf395e86bcea558
2020-02-26 18:44:35 -08:00
2448c97a53 [jit] infer RRef type as container type (#33369)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33369

This PR add RRef type infer rule when we try to infer a type from a
pyobject, this allow script module attributes could contain a rref,
(i.e. List[RRefs] as a module attribute)

Test Plan: Imported from OSS

Differential Revision: D19918320

Pulled By: wanchaol

fbshipit-source-id: e5fd99c0ba5693b22ed48f0c0550b5e1dac89990
2020-02-26 18:43:13 -08:00
857eb4145e [JIT] add support for torch.cdist (#33737)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/33737

Test Plan: Imported from OSS

Differential Revision: D20121916

Pulled By: eellison

fbshipit-source-id: b0427bbfd3ade1f3129c4a95a542fbc32c3abd76
2020-02-26 18:37:37 -08:00
f31b1d3453 [JIT] add support for lu_unpack (#33736)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/33736

Test Plan: Imported from OSS

Differential Revision: D20121914

Pulled By: eellison

fbshipit-source-id: 1136f4d7678a2233129aefe3e30234af385b8353
2020-02-26 18:37:33 -08:00
4543cf4eb1 [JIT] add support for torch.lu to torchscript (#33724)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33724

Fix for https://github.com/pytorch/pytorch/issues/33381, partial fix of https://github.com/pytorch/pytorch/issues/30786

Test Plan: Imported from OSS

Differential Revision: D20077321

Pulled By: eellison

fbshipit-source-id: a1e6a0370712b36c9f66979098ac2f9d500ca5f6
2020-02-26 18:37:28 -08:00
fddf73250d [JIT] fix resolving of functions in torch/functional. fix compilation of torch.stft (#33504)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33504

Fix resolution fo functions that are bound onto torch in torch/functional.py. This does not fix compilation of all of those functions, those will be done in follow ups. Does torch.stft as a start.

Fixes #21478

Test Plan: Imported from OSS

Differential Revision: D20014591

Pulled By: eellison

fbshipit-source-id: bb362f1b5479adbb890e72a54111ef716679d127
2020-02-26 18:35:43 -08:00
057fd5e10d add support for _modules, reducing special casing of nn.Sequential (#29495)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/29495

This PR adds support for `_modules`, making it so we no longer need to special case support for `nn.Sequential`. I was getting internal errors around the previous approach using `self.define()`, so i am adding this PR as part of the stack.

Fix for https://github.com/pytorch/pytorch/issues/28998

Test Plan: Imported from OSS

Differential Revision: D18412561

Pulled By: eellison

fbshipit-source-id: a8b24ebee39638fccf63b2701f65f8bb0de84faa
2020-02-26 18:07:19 -08:00
6eef66e1f4 .circleci: Divert packages to test channel on tag (#33842)
Summary:
This sets up PIP_UPLOAD_FOLDER to point to the correct channel for
release candidates as opposed to nightlies.

Removes an old safety check that's not needed anymore for devtoolset3

And provides a nice default for PIP_UPLOAD_FOLDER, which should clear up
confusion on where it's initially set

This is a stepping stone towards the promotable pipeline.

Signed-off-by: Eli Uriegas <eliuriegas@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33842

Differential Revision: D20130791

Pulled By: seemethere

fbshipit-source-id: dac94ef46299574c36c08c968dd36faddeae6363
2020-02-26 17:25:18 -08:00
cd0acf4374 port masked_fill from TH to ATen (#33330)
Summary:
port `masked_fill` from TH to ATen with TensorIterator.

single core performance roughly stays the same, single socket performance has **3~16x** boost.

`masked_fill` is missing from https://github.com/pytorch/pytorch/issues/24507
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33330

Differential Revision: D20098812

Pulled By: VitalyFedyunin

fbshipit-source-id: ff20712ffc00cc665550997abcfdfb8916c18e40
2020-02-26 17:20:07 -08:00
a0e90e1b45 ONNX Error Message on Missing Op (#33593)
Summary:
Print a complete and comprehensive error message with a description of the issue when an op is missing during ONNX export (previously an ambiguous "key not in registry" error was thrown which was not helpful for the user to understand the failure).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33593

Reviewed By: hl475

Differential Revision: D20052213

Pulled By: houseroad

fbshipit-source-id: ae3010a97efdab26effad5e4a418e9cc41f5b04e
2020-02-26 15:18:16 -08:00
02908dfa67 remove setStorage with null StorageImpl support. (#33735)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33735

This apparently used to create a new storage, but I couldn't find anywhere in the code where this actually happens.

Changing it to an assert to see what happens.

Test Plan: Imported from OSS

Differential Revision: D20084029

Pulled By: gchanan

fbshipit-source-id: e9c4db115a25fc2e17a3b166c1ff5a0e6b56d690
2020-02-26 15:12:41 -08:00
04f88a3a7b Add partition info message to NetDef (#33616)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33616

Att. We start by assign `node_name` of DeviceOption in each of the op in the net. The for each unique node_name, we will have a PartitionInfo describing the partition, including logic devices that it can be assigned and we establish the link by partition names.

Test Plan:
unittests

Canaries:
AF: https://our.intern.facebook.com/intern/ads/canary/424817103900710410
AI: https://our.intern.facebook.com/intern/ads/canary/424737510862189908

Reviewed By: ipiszy, bangshengtang, jfix71

Differential Revision: D20015493

fbshipit-source-id: 0bb0f30cfc3892f7b8709d87b8bc1fbab2f2c46d
2020-02-26 14:54:58 -08:00
51e405743f Revert D20010383: [jit] Unify augmented assign handling
Test Plan: revert-hammer

Differential Revision:
D20010383

Original commit changeset: 52e559ce907e

fbshipit-source-id: 7ca938070d5e98c91e7a7b8485a3c1e790c3ceb2
2020-02-26 14:22:14 -08:00
867990dc17 [jit] Unify augmented assign handling (#33578)
Summary:
Stacked PRs
 * **#33578 - [jit] Unify augmented assign handling**
 * #32993 - [jit] Fix aug assign for non-tensor attributes

We handle augmented assignments to `Select` and `Var` statements differently, but the actual in place update is the same for both, so this PR factors it out into a method so we don't have 2 code paths doing the same thing.
](https://our.intern.facebook.com/intern/diff/20010383/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33578

Pulled By: driazati

Differential Revision: D20010383

fbshipit-source-id: 52e559ce907e95e5c169ab9d9690d0d235db36f3
2020-02-26 14:09:40 -08:00
c32fa465a5 Preserve Backward compatibility of models serialized before #31040 (#33796)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/33796

Test Plan: Imported from OSS

Differential Revision: D20109662

Pulled By: jerryzh168

fbshipit-source-id: 9bc936a59fd6dd1031fbf05eb90f98ae9677b936
2020-02-26 13:40:38 -08:00
5c33d98b0d Add assert_tensor_equal and assert_tensor_not_equal to test/cpp/api/support.h (#30426)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30426

This PR adds `assert_tensor_equal` and `assert_tensor_not_equal` to `test/cpp/api/support.h`, as better functions for testing whether two tensors are equal / not equal.

Test Plan: Imported from OSS

Differential Revision: D18695900

Pulled By: yf225

fbshipit-source-id: c19b9bc4c4e84d9f444015023649d27618fcbdf5
2020-02-26 13:25:25 -08:00
8aa09de19e build: set -DNDEBUG in Release (#32719)
Summary:
This might lead to silent undefined behaviour (e.g. with out-of-bound indices). This affects `test_multinomial_invalid_probs_cuda` which is now removed.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32719

Test Plan:
* Build with VERBOSE=1 and manually inspect `less ndebug.build.log | grep 'c++' | grep -v -- -DNDEBUG` (only with nina on Linux)
* CI

Fixes https://github.com/pytorch/pytorch/issues/22745

Differential Revision: D20104340

Pulled By: yf225

fbshipit-source-id: 2ebfd7ddae632258a36316999eeb5c968fb7642c
2020-02-26 12:53:31 -08:00
93e30c16cb .circleci: Switch to using robot token for conda uploads (#33786)
Summary:
Thanks to pjh5 for continued use of his account to upload binaries but I
think we can start using a bot account now for this.

Just a draft until we can ensure the env variables get injected correctly and the token can actually upload

Signed-off-by: Eli Uriegas <eliuriegas@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33786

Differential Revision: D20122423

Pulled By: seemethere

fbshipit-source-id: 0444584831a40ae730325d258935f6d1b873961b
2020-02-26 11:37:40 -08:00
45e4b614d1 Per channel quantization performance improvement (#33772)
Summary:
Benchmark:
NVIDIA GTX 1650 + AMD Ryzen Threadripper 3970X
```python
import torch
print(torch.__version__)

for i in range(1000):
    torch.randn(1024 * 128, device='cuda')

def cuda(e):
    a = torch.randn(2 ** e, 32, device='cuda')
    s = torch.randn(32, device='cuda')
    z = torch.randn(32, device='cuda')
    torch.cuda.synchronize()
    %timeit torch.fake_quantize_per_channel_affine(a, s, z, 1, -999, 999); torch.cuda.synchronize()

def cpu(e):
    a = torch.randn(2 ** e, 32, device='cpu')
    s = torch.randn(32, device='cpu')
    z = torch.randn(32, device='cpu')
    %timeit torch.fake_quantize_per_channel_affine(a, s, z, 1, -999, 999);

for i in range(10, 24):
    cuda(i)
print()
for i in range(10, 32):
    cpu(i)
```
Before
```
1.5.0a0+9bc922d
849 µs ± 44.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
817 µs ± 30.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
814 µs ± 2.93 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
1.11 ms ± 1.32 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
1.19 ms ± 4.19 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
1.6 ms ± 5.58 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
2.44 ms ± 14.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
4.14 ms ± 2.55 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
7.41 ms ± 2.46 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
13.9 ms ± 2.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
26.9 ms ± 254 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
52.6 ms ± 260 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
104 ms ± 176 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
207 ms ± 1.24 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

249 µs ± 158 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
420 µs ± 230 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
766 µs ± 391 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
1.45 ms ± 574 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
2.84 ms ± 34.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
5.69 ms ± 83 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
7.29 ms ± 2.58 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
7.32 ms ± 13.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
17.4 ms ± 38.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
47.5 ms ± 264 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
187 ms ± 1.19 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
379 ms ± 5.05 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
652 ms ± 11.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
1.22 s ± 4.58 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
2.34 s ± 8.77 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
4.56 s ± 7.15 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
8.97 s ± 33.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
17.8 s ± 32.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
35.2 s ± 167 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
```
After
```
1.5.0a0+a7ec8cc
92.5 µs ± 2.03 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
97.7 µs ± 469 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
109 µs ± 4.73 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
119 µs ± 6.17 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
146 µs ± 1.84 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
211 µs ± 2.45 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
347 µs ± 4.18 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
624 µs ± 14.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
1.17 ms ± 16.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
2.25 ms ± 48.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
4.43 ms ± 220 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
8.51 ms ± 44.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
16.9 ms ± 30.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
33.7 ms ± 7.64 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

201 µs ± 234 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
285 µs ± 465 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
287 µs ± 214 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
287 µs ± 221 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
287 µs ± 761 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
347 µs ± 399 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
675 µs ± 213 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
1.34 ms ± 643 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
4.82 ms ± 34.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
10.7 ms ± 88.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
20.3 ms ± 25.6 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
39.4 ms ± 242 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
78.8 ms ± 2 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
153 ms ± 786 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
285 ms ± 911 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
541 ms ± 1.09 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
1.03 s ± 1.67 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
1.97 s ± 8.59 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
3.81 s ± 10.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
```

Fixes https://github.com/pytorch/pytorch/issues/33647
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33772

Differential Revision: D20112531

Pulled By: ngimel

fbshipit-source-id: f90e3ef1b5be8276851637f3e1251cb8f1af411f
2020-02-26 10:19:25 -08:00
f597ac6efc Fix grid_sample gradients at image borders (#32829)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/23925

This fixes the incorrect gradients returned by `F.grid_sample` at image borders under `"border"` and `"reflection"` padding modes.

At nondifferentiable points, the choice of which gradient to return among its super- or subgradients is rather arbitrary and generally does not affect training. Before this change, however, a bug in the code meant that the gradient returned at the exact borders was not selected from among the super- or subgradients.

The gradient is now set to zero at the borders, which is a defensible choice for both the `"border"` and `"reflection"` padding modes:
* For `"border"` padding, this effectively means that the exact borders of the image are now considered out of bounds, and therefore receive zero gradient.
* For `"reflection"` padding, this effectively treats the exact borders as extrema.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32829

Differential Revision: D20118564

Pulled By: soumith

fbshipit-source-id: ef8571ff585be35ab1b90a922af299f53ab9c095
2020-02-26 10:10:42 -08:00
b8f0acf50f Fix examples with updated pruning naming convention (#33144)
Summary:
Fix in docs requested by vainaijr.
Closes issue https://github.com/pytorch/pytorch/issues/32991
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33144

Differential Revision: D20104640

Pulled By: albanD

fbshipit-source-id: 9b1be2c1cbde1964967967a9581bb6932a305d81
2020-02-26 10:02:50 -08:00
a8e7ed48f4 [pt][quant] Parallelize quantize and dequantize (#33765)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33765

quantize and dequantize methods now make use of multiple threads. This makes use of shz0116's recent parallelization of quantize/dequantize routines in FBGEMM.

Fixes:
https://github.com/pytorch/pytorch/issues/32006
https://github.com/pytorch/FBGEMM/issues/142

Alternative to https://github.com/pytorch/pytorch/pull/30153

```
#!/usr/bin/env python

import time
import torch
import torch.nn as nn
torch.set_num_threads(4)
# print(torch.__config__.parallel_info())

W = torch.rand(1, 54, 54, 256)

NITER = 1000
s = time.time()
for i in range(NITER):
    W_q = torch.quantize_per_tensor(W, scale=1.0, zero_point = 0, dtype=torch.quint8)
time_per_iter = (time.time() - s) / NITER

print('quantize time per iter ms', time_per_iter * 1000)

s = time.time()
for i in range(NITER):
    W_deq = W_q.dequantize()
time_per_iter = (time.time() - s) / NITER

print('dequantize time per iter ms', time_per_iter * 1000)
```

### With 1 thread
quantize time per iter ms 0.22633790969848633
dequantize time per iter ms 0.6573665142059326

### With 4 threads
quantize time per iter ms 0.0905618667602539
dequantize time per iter ms 0.19511842727661133
ghstack-source-id: 98935895

Test Plan: python test/test_quantized.py

Reviewed By: jspark1105

Differential Revision: D20098521

fbshipit-source-id: bd8c45761b4651fcd5b20b95759e3868a136c048
2020-02-26 10:00:40 -08:00
2eb95d8f4a Migrate fmod and fmod_ from TH to ATen (CPU) (#33592)
Summary:
Closes https://github.com/pytorch/pytorch/issues/24701
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33592

Differential Revision: D20043875

Pulled By: ezyang

fbshipit-source-id: b8c0a4e73a3cef6e55e91bbd35f8aadca8114c56
2020-02-26 09:35:16 -08:00
f87b0b2515 Remove the use of macros in defining binary ops for base Vec256 (#33733)
Summary:
This greatly improves readability and maintainability (e.g., debugging)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33733

Differential Revision: D20103187

Pulled By: ezyang

fbshipit-source-id: e539e46f5d378a2b01da7ecaa6b850655e0fa866
2020-02-26 09:21:35 -08:00
c1dd70688a Fix deprecated python "add" calls (#33428)
Summary:
This PR fixed those python "add" calls using deprecated signature `add(Scalar, Tensor)`. The alternative signature `add(Tensor, alpha = Scalar)` is used.

cc csarofeen zasdfgbnm ptrblck ngimel
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33428

Differential Revision: D20002534

Pulled By: vincentqb

fbshipit-source-id: 81f2dd6170a47a9b53a17e5817c26e70d8afa130
2020-02-26 09:02:31 -08:00
24659d28a1 Feature/vonmises upstream (#33418)
Summary:
Third try of https://github.com/pytorch/pytorch/issues/33177 😄
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33418

Differential Revision: D20069683

Pulled By: ezyang

fbshipit-source-id: f58e45e91b672bfde2e41a4480215ba4c613f9de
2020-02-26 08:19:12 -08:00
758ad516f3 [Lite interpreter] Pass shared_ptr properly (#33667)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33667

Pass shared_ptr properly according to C++ guidances. Thank kimishpatel for pointing it out.

Test Plan: Imported from OSS

Differential Revision: D20111001

Pulled By: iseeyuan

fbshipit-source-id: 213a0f950a7f3b9199d789dc0155911f6102d77a
2020-02-25 21:40:05 -08:00
fc6a153688 [WIP] Reanimate gradient scaling API with original scale update heuristic (#33366)
Summary:
Also, windows memory failures responsible for the earlier reversion have been fixed.

This PR (initially) contains 2 commits:
* a revert of the revert
* all changes to implement the original Apex scale update heuristic, squashed into a single commit for easier diff review
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33366

Differential Revision: D20099026

Pulled By: ngimel

fbshipit-source-id: 339b9b6bd5134bf055057492cd1eedb7e4461529
2020-02-25 19:00:34 -08:00
a836c4ca78 Skip manual backward for cdist with case p=2 (#31167)
Summary:
Fixes an issue with `cdist` backward calculation for large inputs for the euclidean case.

The grid size when launching the kernel exceeded the 2^16 limit for the second dimension, resulting in `RuntimeError: CUDA error: invalid configuration argument`

Code to reproduce:

```
h, w, d = 800, 1216, 12
n = 133
A = torch.randn(n, d).cuda()
B = torch.randn(h, w, d).cuda()
A.requires_grad = True
B.requires_grad = True

B = B.reshape(-1, d).contiguous()
dist = torch.cdist(A, B)
loss = dist.sum()
loss.backward()
```

Thanks to tkerola for the bug report, reproduction and suggesting a solution.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31167

Differential Revision: D20035605

Pulled By: ngimel

fbshipit-source-id: ae28ba4b549ee07a8bd937bb1de2438dc24eaa17
2020-02-25 18:19:30 -08:00
9a5ea71380 pad_packed_sequence: doc improvement (#33768)
Summary:
pad_packed_sequence:
1. clarify that batch's order is restored to the original one
2. add example

This is a follow up to https://github.com/pytorch/pytorch/issues/33746
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33768

Differential Revision: D20102792

Pulled By: ngimel

fbshipit-source-id: 5ef511e5e3833edcb85cc01af0e92568b6d7a3cf
2020-02-25 18:00:04 -08:00
5bac7febad removed padding and dilation from LPPool2d Doc (#33714)
Summary:
removed padding and dilation from LPPool2d Doc as the function dose not support padding and dilation
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33714

Differential Revision: D20097021

Pulled By: ngimel

fbshipit-source-id: fc1c2d918b32f4b45c7e6e6bd93f018e867a628f
2020-02-25 17:54:38 -08:00
038ee01393 Disable printing of the histogram when dump (#33749)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33749

Disable printing of the histogram when dump to make the log cleaner.

Test Plan: CI

Reviewed By: amylittleyang

Differential Revision: D20087735

fbshipit-source-id: 5421cd9d25c340d92f29ce63fed2a58aefef567d
2020-02-25 17:37:55 -08:00
8667379133 [quant][graphmode][refactor] Factor out insertDequantCall (#33172)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33172

For code reuse

Test Plan:
.

Imported from OSS

Differential Revision: D20087842

fbshipit-source-id: 797868d31b96c4ff8640121ea4bee1396deb6b57
2020-02-25 17:22:35 -08:00
a13ee18982 [quant][graphmode] refactor nodeQuantizable (#33171)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33171

For better code reuse

Test Plan:
.

Imported from OSS

Differential Revision: D20087845

fbshipit-source-id: f88cffb410bd54a1b3f937786104f46bcd1190d3
2020-02-25 15:20:22 -08:00
8159316714 Revert D19941103: [pytorch] blas gemm fix for k=0
Test Plan: revert-hammer

Differential Revision:
D19941103

Original commit changeset: e1c85d1e7574

fbshipit-source-id: da12747130c60b61452aa46e269c66546a1075f9
2020-02-25 13:30:38 -08:00
4d203c6fc8 Move cumprod and cumsum to Aten(CPU) (#33280)
Summary:
This PR is about move cumprod and cumsum to Aten.
Test script:
```
import torch
import torch.nn as nn
import time

torch.manual_seed(0)

def _time():
    return time.time()

device = "cpu"

#torch.set_num_threads(1)

#warm up
for n in [10, 300]:
    input = torch.randn(n, n, n, requires_grad=False, device=device)
    input = input * 0.01 + 1
    for dim in range(input.dim()):
        for i in range(100):
            #output = input.cumsum(dim)
            output = input.cumprod(dim)

for n in [10, 300]:
    input = torch.randn(n, n, n, requires_grad=False, device=device)
    input = input * 0.01 + 1
    for dim in range(input.dim()):
        fwd_t = 0
        for i in range(1000):
            t1 = _time()
            #output = input.cumsum(dim)
            output = input.cumprod(dim)
            t2 = _time()
            fwd_t = fwd_t + (t2 -t1)
        fwd_avg = fwd_t / 1000 * 1000
        print("size = (%d, %d, %d); reduce dim=%d; compute time is %.4f(ms)" % (n, n, n, dim, fwd_avg))
```
Test device: **skx-8180**.
Performance:
```
size = (10, 10, 10); reduce dim=0; compute time is 0.0098(ms)
size = (10, 10, 10); reduce dim=1; compute time is 0.0089(ms)
size = (10, 10, 10); reduce dim=2; compute time is 0.0089(ms)
size = (300, 300, 300); reduce dim=0; compute time is 208.9403(ms)
size = (300, 300, 300); reduce dim=1; compute time is 241.5989(ms)
size = (300, 300, 300); reduce dim=2; compute time is 66.2587(ms)
After:
size = (10, 10, 10); reduce dim=0; compute time is 0.0065(ms)
size = (10, 10, 10); reduce dim=1; compute time is 0.0063(ms)
size = (10, 10, 10); reduce dim=2; compute time is 0.0053(ms)
size = (300, 300, 300); reduce dim=0; compute time is 36.0139(ms)
size = (300, 300, 300); reduce dim=1; compute time is 36.0776(ms)
size = (300, 300, 300); reduce dim=2; compute time is 21.0111(ms)
number_threads = 1:
size = (10, 10, 10); reduce dim=0; compute time is 0.0053(ms)
size = (10, 10, 10); reduce dim=1; compute time is 0.0052(ms)
size = (10, 10, 10); reduce dim=2; compute time is 0.0051(ms)
size = (300, 300, 300); reduce dim=0; compute time is 81.8831(ms)
size = (300, 300, 300); reduce dim=1; compute time is 88.5687(ms)
size = (300, 300, 300); reduce dim=2; compute time is 54.9922(ms)

cumprod:
Before:
size = (10, 10, 10); reduce dim=0; compute time is 0.0096(ms)
size = (10, 10, 10); reduce dim=1; compute time is 0.0088(ms)
size = (10, 10, 10); reduce dim=2; compute time is 0.0088(ms)
size = (300, 300, 300); reduce dim=0; compute time is 221.2601(ms)
size = (300, 300, 300); reduce dim=1; compute time is 249.7894(ms)
size = (300, 300, 300); reduce dim=2; compute time is 71.5182(ms)
number_threads = 1:
size = (10, 10, 10); reduce dim=0; compute time is 0.0100(ms)
size = (10, 10, 10); reduce dim=1; compute time is 0.0093(ms)
size = (10, 10, 10); reduce dim=2; compute time is 0.0093(ms)
size = (300, 300, 300); reduce dim=0; compute time is 207.6287(ms)
size = (300, 300, 300); reduce dim=1; compute time is 241.6693(ms)
size = (300, 300, 300); reduce dim=2; compute time is 66.2977(ms)
After:
size = (10, 10, 10); reduce dim=0; compute time is 0.0063(ms)
size = (10, 10, 10); reduce dim=1; compute time is 0.0062(ms)
size = (10, 10, 10); reduce dim=2; compute time is 0.0053(ms)
size = (300, 300, 300); reduce dim=0; compute time is 36.4283(ms)
size = (300, 300, 300); reduce dim=1; compute time is 38.1139(ms)
size = (300, 300, 300); reduce dim=2; compute time is 20.9140(ms)
number_threads =1:
size = (10, 10, 10); reduce dim=0; compute time is 0.0052(ms)
size = (10, 10, 10); reduce dim=1; compute time is 0.0052(ms)
size = (10, 10, 10); reduce dim=2; compute time is 0.0050(ms)
size = (300, 300, 300); reduce dim=0; compute time is 82.6926(ms)
size = (300, 300, 300); reduce dim=1; compute time is 90.1265(ms)
size = (300, 300, 300); reduce dim=2; compute time is 55.0196(ms)
```
Fix https://github.com/pytorch/pytorch/issues/24668, https://github.com/pytorch/pytorch/issues/24669.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33280

Differential Revision: D20076997

Pulled By: VitalyFedyunin

fbshipit-source-id: 12225767da8cfdc5e44257462a432bffa04cd469
2020-02-25 13:03:16 -08:00
0dded4026e [C++ API] Add PackedSequence / pack_padded_sequence / pad_packed_sequence / pack_sequence (#33652)
Summary:
Most of the function implementation and test code are translated from the Python version.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33652

Differential Revision: D20052211

Pulled By: yf225

fbshipit-source-id: ce6767db54364f91ef4f06674239a12278c2752a
2020-02-25 12:53:41 -08:00
c20628c5f6 Remove clean_tag from tensorboard (#33133)
Summary:
The function originally comes from 4279f99847/tensorflow/python/ops/summary_op_util.py (L45-L68)

As its comment says:
```
    # In the past, the first argument to summary ops was a tag, which allowed
    # arbitrary characters. Now we are changing the first argument to be the node
    # name. This has a number of advantages (users of summary ops now can
    # take advantage of the tf name scope system) but risks breaking existing
    # usage, because a much smaller set of characters are allowed in node names.
    # This function replaces all illegal characters with _s, and logs a warning.
    # It also strips leading slashes from the name.
```

This function is only for compatibility with TF's operator name restrictions, and is therefore no longer valid in pytorch. By removing it, tensorboard summaries can use more characters in the names.

Before:
![0209-12:10:14](https://user-images.githubusercontent.com/1381301/74109072-37382e00-4b35-11ea-8c9f-ab37a8bd5808.png)

After:
![0209-12:10:57](https://user-images.githubusercontent.com/1381301/74109081-4323f000-4b35-11ea-9dab-447f8466a41e.png)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33133

Differential Revision: D20089307

Pulled By: ezyang

fbshipit-source-id: 3552646dce1d5fa0bde7470f32d5376e67ec31c6
2020-02-25 12:41:58 -08:00
72288e82e2 Use shim executable sccache-cl as the compiler instead of sccache cl (#33745)
Summary:
CMake only views the first item of `CC` and `CXX` as executable. So calling `sccache.exe` directly won't work. Using a shim executable resolves this problem.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33745

Differential Revision: D20100397

Pulled By: soumith

fbshipit-source-id: 3a130d30dd548b7c2e726c064e66ae4fccb30c44
2020-02-25 12:24:05 -08:00
0e74cbcc54 Revert "Revert "Revert D19975411: Remove special case codegen for tril_indices/triu_indices." (#33572)" (#33742)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33742

This reverts commit 90f4c5695e1785883d9ae7c86ad3fabd1963a4cb.

Test Plan: Imported from OSS

Differential Revision: D20095103

Pulled By: ezyang

fbshipit-source-id: ff47dae21c278570b4ca497d76deedb75823d6d7
2020-02-25 12:09:49 -08:00
9bc922d518 Extend cuda install timeout for Windows jobs (#33755)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/33755

Differential Revision: D20100372

Pulled By: soumith

fbshipit-source-id: 8b39177d3e87d248857f0582de6c9e203d09d4a7
2020-02-25 11:51:43 -08:00
7eba36b1f6 [quant][graphmode][refactor] Separate preprocess step for insertObserver (#32813)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32813

We need to separate the step to make the logic more clear
and also to find all the values we want to skip in advance
without the interference of inserted observers

Test Plan:
.

Imported from OSS

Differential Revision: D20087841

fbshipit-source-id: ec3654ca561c0d4e2c05011988bb9ecc8671c5c2
2020-02-25 11:26:22 -08:00
d82093e665 [profiler] remove redundant assert in record_function_ops (#33225)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33225

This removes a redundant assert statement in `record_function_ops`. In
the else branch in question, we are guaranteed to have `current == &rec`, so
this assert will never fire.

Although, maybe we should add an assert failure when `current == &rec` since it
seems that `current` should always be profiler::record_function_exit.
ghstack-source-id: 98852219

Test Plan: Existing autograd profiler UTs past

Differential Revision: D19849145

fbshipit-source-id: 2014a0d3b9d11e5b64942a54e0fb45e21f46cfa2
2020-02-25 10:59:10 -08:00
2b404de347 [scripts] Add script to fetch clang-format binary from AWS S3 (#33644)
Summary:
**Summary**
This commit adds a script that fetches a platform-appropriate `clang-format` binary
from S3 for use during PyTorch development. The goal is for everyone to use the exact
same `clang-format` binary so that there are no formatting conflicts.

**Testing**
Ran the script.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33644

Differential Revision: D20076598

Pulled By: SplitInfinity

fbshipit-source-id: cd837076fd30e9c7a8280665c0d652a33b559047
2020-02-25 10:47:03 -08:00
98526c7444 Migrate fake_quant_slice to TensorIterator (#33744)
Summary:
This is a quick improvement for per tensor quantization.

per-channel should remove the loop in https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/native/quantized/fake_quant_per_channel_affine.cpp

# Benchmark:
device = GTX-1650
```python
import torch
print(torch.__version__)

for i in range(1000):
    torch.randn(1024 * 128, device='cuda')

def f(e):
    a = torch.randn(2 ** e, device='cuda')
    torch.cuda.synchronize()
    %timeit torch.fake_quantize_per_tensor_affine(a, 0.5, 0, 0, 1); torch.cuda.synchronize()

for i in range(15, 27):
    f(i)
```
Before
```
1.5.0a0+bf00b4d
14.5 µs ± 981 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
18.2 µs ± 1.09 µs per loop (mean ± std. dev. of 7 runs, 100000 loops each)
25.6 µs ± 2.72 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
38.6 µs ± 135 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
70.2 µs ± 5.21 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
125 µs ± 4.98 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
231 µs ± 1.36 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
461 µs ± 22.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
891 µs ± 88.2 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
1.77 ms ± 8.13 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
3.77 ms ± 80.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
7.16 ms ± 216 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
```
After
```
1.5.0a0+3f18ac3
12.5 µs ± 738 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
13.7 µs ± 195 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
17.9 µs ± 850 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
29.7 µs ± 285 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
50.4 µs ± 1.94 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
95 µs ± 8.23 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
173 µs ± 7.37 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
348 µs ± 29.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
657 µs ± 22.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
1.33 ms ± 77.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
2.71 ms ± 211 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
5.33 ms ± 439 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33744

Differential Revision: D20090129

Pulled By: ngimel

fbshipit-source-id: 5dd48a0c5455a2b6c5c638d747c1767cb259255d
2020-02-25 10:44:21 -08:00
8196ec0115 Remove some dead THStorage related code. (#33734)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/33734

Test Plan: Imported from OSS

Differential Revision: D20084030

Pulled By: gchanan

fbshipit-source-id: 29aa5459e8ecc8af8af31157797f44057d6a786e
2020-02-25 09:44:05 -08:00
5ef1c2c5d2 Back out "[pt][quant] RNN debug test" (#33750)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33750

Original commit changeset: 8c38d8f067e5
ghstack-source-id: 98911215

Test Plan: CI

Differential Revision: D20090521

fbshipit-source-id: 73df43ad60574e44e80b36ebf6392030c3efb66e
2020-02-25 09:28:00 -08:00
ee23944f46 [Caffe2] Fix shape inference for element-wise operators (#33431)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33431

Some elementwise operators don't have shape and type inference specified for the output tensor: `BitwiseOr`, `BitwiseAnd`, `BitwiseXor`, `Not`, `Sign`.

This change fixes this issue:
- For `Not` and `Sign` operators, the output has the same type and shape as the input, so `IdenticalTypeAndShapeOfInput` function is used to specify that.
- For bitwise operators created by `CAFFE2_SCHEMA_FOR_BINARY_BITWISE_OP` macro, the type and shape inference rules should be the same as for other binary element-wise operators, so `TensorInferenceFunction(ElementwiseOpShapeInference)` is used to specify that.

Also some tests were modified to ensure that the shape and type are inferred (`ensure_outputs_are_inferred` parameter)

Test Plan:
```
CAFFE2_ASSERT_SHAPEINFERENCE=1 buck test caffe2/caffe2/python/operator_test:elementwise_ops_test
CAFFE2_ASSERT_SHAPEINFERENCE=1 buck test caffe2/caffe2/python/operator_test:math_ops_test
```

Note that the tests have to be executed with `CAFFE2_ASSERT_SHAPEINFERENCE=1` in order to fail upon shape inference failure.

Reviewed By: idning

Differential Revision: D19880164

fbshipit-source-id: 5d7902e045d79e5669e5e98dfb13a39711294939
2020-02-25 09:03:06 -08:00
819ca2c285 add bfloat16 conversion method in type stub (__init__.pyi) (#33747)
Summary:
Resolve https://github.com/pytorch/pytorch/issues/33699

`torch/__init__.pyi` will be generated like

```python
# TODO: One downside of doing it this way, is direct use of
# torch.tensor.Tensor doesn't get type annotations.  Nobody
# should really do that, so maybe this is not so bad.
class Tensor:
    requires_grad: _bool = ...
    grad: Optional[Tensor] = ...

    # some methods here...

    overload
    def bernoulli_(self, p: _float=0.5, *, generator: Generator=None) -> Tensor: ...
    def bfloat16(self) -> Tensor: ...
    def bincount(self, weights: Optional[Tensor]=None, minlength: _int=0) -> Tensor: ...

    # some methods here...
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33747

Differential Revision: D20090316

Pulled By: ngimel

fbshipit-source-id: b9ce4c0d4ef720c94ccac0a0342a012e8cf3af0c
2020-02-25 08:49:47 -08:00
fd175fa8a2 fix bugs in gen_pyi.py (#33748)
Summary:
This loop should generate type hints for inplace binary operator methods (`binop` variable) but had been using `name` variable. That's why that wrong type hints had been generated.

Resolve https://github.com/pytorch/pytorch/issues/33698

 ---

Current `__init__.pyi` has these type hints.

```python
class Tensor:

    # some codes here...

    overload
    def zeros_like_(self, other: Union[Tensor, Number]) -> Tensor: ...
    overload
    def zeros_like_(self, value: Number, other: Union[Tensor, Number]) -> Tensor: ...
    overload
    def zeros_like_(self, other: Union[Tensor, Number], *, out: Optional[Tensor]=None) -> Tensor: ...
    overload
    def zeros_like_(self, value: Number, other: Union[Tensor, Number], *, out: Optional[Tensor]=None) -> Tensor: ...
    overload
    def zeros_like__(self, other: Union[Tensor, Number]) -> Tensor: ...
    overload
    def zeros_like__(self, value: Number, other: Union[Tensor, Number]) -> Tensor: ...
    overload
    def zeros_like__(self, other: Union[Tensor, Number], *, out: Optional[Tensor]=None) -> Tensor: ...
    overload
    def zeros_like__(self, value: Number, other: Union[Tensor, Number], *, out: Optional[Tensor]=None) -> Tensor: ...
    overload
    def zeros_like___(self, other: Union[Tensor, Number]) -> Tensor: ...
    overload
    def zeros_like___(self, value: Number, other: Union[Tensor, Number]) -> Tensor: ...
    overload
    def zeros_like___(self, other: Union[Tensor, Number], *, out: Optional[Tensor]=None) -> Tensor: ...
    overload
    def zeros_like___(self, value: Number, other: Union[Tensor, Number], *, out: Optional[Tensor]=None) -> Tensor: ...
    overload
    def zeros_like____(self, other: Union[Tensor, Number]) -> Tensor: ...
    overload
    def zeros_like____(self, value: Number, other: Union[Tensor, Number]) -> Tensor: ...
    overload
    def zeros_like____(self, other: Union[Tensor, Number], *, out: Optional[Tensor]=None) -> Tensor: ...
    overload
    def zeros_like____(self, value: Number, other: Union[Tensor, Number], *, out: Optional[Tensor]=None) -> Tensor: ...

    # some codes here...
```

But `__init__.pyi` should generate these type hints.

```python
class Tensor:

    # some codes here...

    overload
    def add_(self, other: Union[Tensor, Number]) -> Tensor: ...
    overload
    def add_(self, value: Number, other: Union[Tensor, Number]) -> Tensor: ...
    overload
    def add_(self, other: Union[Tensor, Number], *, out: Optional[Tensor]=None) -> Tensor: ...
    overload
    def add_(self, value: Number, other: Union[Tensor, Number], *, out: Optional[Tensor]=None) -> Tensor: ...

    # some codes here...

    overload
    def div_(self, other: Union[Tensor, Number]) -> Tensor: ...
    overload
    def div_(self, value: Number, other: Union[Tensor, Number]) -> Tensor: ...
    overload
    def div_(self, other: Union[Tensor, Number], *, out: Optional[Tensor]=None) -> Tensor: ...
    overload
    def div_(self, value: Number, other: Union[Tensor, Number], *, out: Optional[Tensor]=None) -> Tensor: ...

    # some codes here...

    overload
    def mul_(self, other: Union[Tensor, Number]) -> Tensor: ...
    overload
    def mul_(self, value: Number, other: Union[Tensor, Number]) -> Tensor: ...
    overload
    def mul_(self, other: Union[Tensor, Number], *, out: Optional[Tensor]=None) -> Tensor: ...
    overload
    def mul_(self, value: Number, other: Union[Tensor, Number], *, out: Optional[Tensor]=None) -> Tensor: ...

    # some codes here...

    overload
    def sub_(self, other: Union[Tensor, Number]) -> Tensor: ...
    overload
    def sub_(self, value: Number, other: Union[Tensor, Number]) -> Tensor: ...
    overload
    def sub_(self, other: Union[Tensor, Number], *, out: Optional[Tensor]=None) -> Tensor: ...
    overload
    def sub_(self, value: Number, other: Union[Tensor, Number], *, out: Optional[Tensor]=None) -> Tensor: ...

    # some codes here...
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33748

Differential Revision: D20090444

Pulled By: ngimel

fbshipit-source-id: e4a5dd08126629ec4c54b630a87ee540e669ec9a
2020-02-25 08:45:19 -08:00
6bdb59539f follow-up test_torch .data removal (#33696)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33696

This changes two tests:
- The batchnorm inference cannot change the memory format of the weights as they are 1D. So this is removed.
- The batchnorm test now run both in affine and not affine mode.
- I added back the test for type errors using .data. In particular, `.data` allows to change the type of a Tensor inplace (very bad, never do it!) but since it is possible, we should test it until .data is removed.

cc Enealor who did the first version of the PR.

Test Plan: Imported from OSS

Differential Revision: D20069241

Pulled By: albanD

fbshipit-source-id: a0348f40c44df38d654fb2a2b2b526d9d42f598a
2020-02-25 07:36:42 -08:00
4ef854b4b4 Fix potential hang when exiting main process (#33721)
Summary:
The following script reproduces the hang
```py
import multiprocessing, logging
logger = multiprocessing.log_to_stderr()
logger.setLevel(multiprocessing.SUBDEBUG)

import torch

class Dataset:
    def __len__(self):
        return 23425

    def __getitem__(self, idx):
        return torch.randn(3, 128, 128), idx % 100

ds = Dataset()
trdl = torch.utils.data.DataLoader(ds, batch_size=64, num_workers=300, pin_memory=True, shuffle=True)

for e in range(1000):
    for ii, (x, y) in enumerate(trdl):
        print(f'tr {e: 5d} {ii: 5d} avg y={y.mean(dtype=torch.double).item()}')
        if ii % 2 == 0:
            print("="*200 + "BEFORE ERROR" + "="*200)
            1/0
```

The process will hang at joining the putting thread of `data_queue` in **main process**. The root cause is that too many things are put in the queue from the **worker processes**, and the `put` at 062ac6b472/torch/utils/data/dataloader.py (L928) is blocked at background thread. The `pin_memory_thread` exits from the set `pin_memory_thread_done_event`, without getting the `(None, None)`. Hence, the main process needs the same treatment as the workers did at
062ac6b472/torch/utils/data/_utils/worker.py (L198) .

After the patch, the script finishes correctly.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33721

Differential Revision: D20089209

Pulled By: ezyang

fbshipit-source-id: e73fbfdd7631afe1ce5e1edd05dbdeb7b85ba961
2020-02-25 07:04:41 -08:00
7a8b6c2c6b [pytorch] blas gemm fix for k=0 (#33419)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33419

These conditions are for the specific implementation, the fallback implementation works without these checks. So use that if any of these checks isn't true.
ghstack-source-id: 98836075

Test Plan: Previously got error for special case where k=0 which has gone. The error was in some complicated autograd, and I'm not sure how and where an simple regression test should be added.

Differential Revision: D19941103

fbshipit-source-id: e1c85d1e75744b1c51ad9b71c7b3211af3c5bcc6
2020-02-25 06:49:50 -08:00
4460c8b034 [C2] Tiny changes to adagrad to make it slightly better. (#33727)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33727

Some small changes to adagrad (tiny bit faster, though there is more interesting diff in the stack on this).

Test Plan: Part of the stack

Reviewed By: chocjy

Differential Revision: D20029499

fbshipit-source-id: 7f4fddb9288d7881ef54673b17a0e19ef10d64c0
2020-02-24 23:02:17 -08:00
65864d3634 [C2] Small improvement for elementwise_mul operator. (#33537)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33537

Cases of embeddings smaller than 128, we can get a bit more compute by
allocating less threads per block.

Test Plan: Unit-test, benchmark.

Reviewed By: xianjiec

Differential Revision: D19969594

fbshipit-source-id: 6cc6b14fc61302804bed9093ea3591f21e3827d8
2020-02-24 23:00:27 -08:00
adbe289870 Update MKL to 2020.0.166 for Windows (#33690)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/33690

Differential Revision: D20089300

Pulled By: ezyang

fbshipit-source-id: 887c006fbdb2c837f0a1c607a196811f44f1fb35
2020-02-24 22:43:34 -08:00
36919278cc C++ tensor multi-dim indexing: add index() and index_put_() overloads, simple indexing tests, merge with Python indexing path (#32841)
Summary:
This PR adds the following items:
- **1st item**: `ArrayRef<TensorIndex>` and `std::initializer_list<TensorIndex>` overloads for `Tensor::index` and `Tensor::index_put_`, to be used specifically for multi-dim indexing purpose.

Design rationale:
* C++ `Tensor::index` and `Tensor::index_put_` are both existing tensor APIs, and they currently (before this PR) only accept a list of tensors (i.e. `ArrayRef<Tensor>`) as indices. If we change their signatures to also accept non-tensors as indices (i.e. `ArrayRef<TensorIndex>`, and `TensorIndex` is convertible from `Tensor` / `Slice` / `None` / `Ellipsis`), it would slow down the original code path (since now it has to go through more steps), which is undesirable.

    To get around this problem, the proposed solution is to keep the original `ArrayRef<Tensor>` overload, and add `ArrayRef<TensorIndex>` and `std::initializer_list<TensorIndex>` overloads to `Tensor::index` and `Tensor::index_put_`. This way, the original code path won’t be affected, and the tensor multi-dim indexing API is only used when the user explicitly pass an `ArrayRef<TensorIndex>` or a braced-init-list of `TensorIndex`-convertible types to `Tensor::index` and `Tensor::index_put_` .

    Note that the above proposed solution would still affect perf for the user’s original `Tensor::index` or `Tensor::index_put_` call sites that use a braced-init-list of tensors as input, e.g. `tensor.index({...})` or `tensor.index_put_({...}, value)`, since now such function calls would take the multi-dim indexing path instead of the original advanced indexing path. However, there are only two instances of this in our codebase (one in ATen cpp test, one in a C++ API nn init function), and they can be easily changed to explicitly use `ArrayRef<Tensor>` as input (I changed them in this PR). For external user’s code, since this is part of the C++ frontend which is still considered experimental, we will only talk about this change in the release note, and ask users to switch to using `ArrayRef<Tensor>` explicitly if they want to keep using the original advanced indexing code path.

- **2nd item**: Mechanisms for parsing `ArrayRef<TensorIndex>` indices and performing indexing operations (mirroring the functions in `torch/csrc/autograd/python_variable_indexing.cpp`).
- **3rd item**: Simple tests to demonstrate that the `Tensor::index()` and `Tensor::index_put_()` APIs work. I will add more tests after the first few PRs are reviewed.
- **4th item**: Merge Python/C++ indexing code paths, for code simplicity. I tested locally and found that there is no perf regression resulting from the merge. I will get more concrete numbers for common use cases when we settle on the overall design.

This PR supersedes https://github.com/pytorch/pytorch/pull/30425.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32841

Differential Revision: D19919692

Pulled By: yf225

fbshipit-source-id: 7467e64f97fc0e407624809dd183c95ea16b1482
2020-02-24 22:04:00 -08:00
6aecfd1e80 Mobile Backend: NHWC memory layout + XNNPACK integration. (#33722)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33722

In order to improve CPU performance on floating-point models on mobile, this PR introduces a new CPU backend for mobile that implements the most common mobile operators with NHWC memory layout support through integration with XNNPACK.

XNNPACK itself, and this codepath, are currently only included in the build, but the actual integration is gated with USE_XNNPACK preprocessor guards.  This preprocessor symbol is intentionally not passed on to the compiler, so as to enable this rollout in multiple stages in follow up PRs.  This changeset will build XNNPACK as part of the build if the identically named USE_XNNPACK CMAKE variable, defaulted to ON, is enabled, but will not actually expose or enable this code path in any other way.

Furthermore, it is worth pointing out that in order to efficiently map models to these operators, some front-end method of exposing this backend to the user is needed.  The less efficient implementation would be to hook these operators into their corresponding native implementations, granted that a series of XNNPACK-specific conditions are met, much like how NNPACK is integrated with PyTorch today for instance.

Having said that, while the above implementation is still expected to outperform NNPACK based on the benchmarks I ran, the above integration would be leave a considerable gap between the performance achieved and the maximum performance potential XNNPACK enables, as it does not provide a way to compute and factor out one-time operations out of the inner most forward() loop.

The more optimal solution, and one we will  decide on soon, would involve either providing a JIT pass that maps nn operators onto these newly introduced operators, while allowing one-time calculations to be factored out, much like quantized mobile models.  Alternatively, new eager-mode modules can also be introduced that would directly call into these implementations either through c10 or some other mechanism, also allowing for decoupling of op creation from op execution.

This PR does not include any of the front end changes  mentioned above.  Neither does it include the mobile threadpool unification present in the original https://github.com/pytorch/pytorch/issues/30644.  Furthermore, this codepath seems to be faster than NNPACK in a good number of use cases, which can potentially allow us to remove NNPACK from aten to make the codebase a little simpler, granted that there is widespread support for such a move.

Regardless, these changes will be introduced gradually and in a more controlled way in subsequent PRs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/32509

Test Plan:
Build: CI
Functionality: Not exposed

Reviewed By: dreiss

Differential Revision: D20069796

Pulled By: AshkanAliabadi

fbshipit-source-id: d46c1c91d4bea91979ea5bd46971ced5417d309c
2020-02-24 21:58:56 -08:00
2a4aad7466 Don't activate vc env again for cuda with ninja on Windows (#33700)
Summary:
Possibly get rid of https://github.com/pytorch/pytorch/issues/28271, https://github.com/pytorch/pytorch/issues/27463 and https://github.com/pytorch/pytorch/issues/25393.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33700

Differential Revision: D20089251

Pulled By: ezyang

fbshipit-source-id: 0cfe62b869fb874e25f06894aa76fadc44cf6817
2020-02-24 21:56:29 -08:00
7caf3c396b [quant][graphmode][refactor] Change signature of getModuleAccessPath (#32812)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32812

We'll error out for the case we can't handle inside the function,
instead of checking each time in the callsite

Test Plan:
.

Imported from OSS

Differential Revision: D20087846

fbshipit-source-id: ae6d33a94adf29c4df86d67783e7ef8753c91f90
2020-02-24 21:52:43 -08:00
a1862468d0 Add missing test launchers for JitRpcTest and JitDistAutogradTest (#32891)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32891

- Add JitDistAutoGradTest into fork/spawn test launcher
- Add JitRpcTest into fork/spawn test launcher

ghstack-source-id: 98900090

Test Plan:
```
buck test mode/dev-nosan //caffe2/test/distributed/rpc:rpc_fork

buck test mode/dev-nosan //caffe2/test/distributed/rpc:rpc_spawn
```

```
buck test mode/dev-nosan //caffe2/test/distributed/rpc:dist_autograd_fork

buck test mode/dev-nosan //caffe2/test/distributed/rpc:dist_autograd_spawn
```

```
buck test mode/dev-nosan //caffe2/test/distributed/rpc/jit:rpc_fork
buck test mode/dev-nosan //caffe2/test/distributed/rpc/jit:rpc_fork_thrift

buck test mode/dev-nosan //caffe2/test/distributed/rpc/jit:rpc_spawn
buck test mode/dev-nosan //caffe2/test/distributed/rpc/jit:rpc_spawn_thrift
```

```
buck test mode/dev-nosan //caffe2/test/distributed/rpc/jit:dist_autograd_fork
buck test mode/dev-nosan //caffe2/test/distributed/rpc/jit:dist_autograd_fork_thrift

buck test mode/dev-nosan //caffe2/test/distributed/rpc/jit:dist_autograd_spawn
buck test mode/dev-nosan //caffe2/test/distributed/rpc/jit:dist_autograd_spawn_thrift
```

Differential Revision: D5785394

fbshipit-source-id: 335a85424d22f1a83874be81a8139499c9a68ce2
2020-02-24 21:42:47 -08:00
a9cef05f5d improve EmbeddingBag performance on cuda (#33589)
Summary:
This PR improves performance of EmbeddingBag on cuda by removing 5 kernel launches (2 of those are synchronizing memcopies).
- 2 memcopies are checking values of offsets[0] and offsets[-1] to be in expected range (0 for the former, less than number of indices for the latter). It seems strange to check only those 2 values, if users are providing invalid offsets, invalid values can be anywhere in the array, not only the first and last element. After this PR, the checks are skipped on cuda, the first value is forced to 0, if the last value is larger than expected, cuda kernel will assert. It is less nice than ValueError, but then again, the kernel could have asserted if other offset values were invalid. On the cpu, the checks are moved inside the cpu implementation from functional.py, and will throw RuntimeError instead of ValueError.
- 3 or 4 initializations (depending on the mode) of the output tensors with .zeros() are unnecessary, because every element of those tensors is written to, so their data can be uninitialized on the start.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33589

Reviewed By: jianyuh

Differential Revision: D20078011

Pulled By: ngimel

fbshipit-source-id: 2fb2e2080313af64adc5cf1b9fc6ffbdc6efaf16
2020-02-24 21:37:34 -08:00
3cf97bc23c Fix typing error of torch/nn/modules/container.pyi.in (#33686)
Summary:
* `Sequential` has `__iter__` method, but type stub doesn't
* `ModuleList.__getitem__` returns `Module`, but type stub doesn't
* Type stub says `ParameterList` has `insert` method, but actual `ParameterList` doesn't
* `ParameterDict.__getitem__` should returns `Parameter`
* `ParameterList` and `ParameterDict` have `extra_repr` methods

 ---

torch/nn/modules/container.py: https://github.com/pytorch/pytorch/blob/master/torch/nn/modules/container.py
torch/nn/modules/container.pyi.in: https://github.com/pytorch/pytorch/blob/master/torch/nn/modules/container.pyi.in
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33686

Differential Revision: D20086730

Pulled By: ngimel

fbshipit-source-id: a8271489417461c67ff84a239c4cd96c3aa17b5c
2020-02-24 21:20:38 -08:00
d6ea4be153 Fix minor problems in index_put_ docs (#33689)
Summary:
Fix for https://github.com/pytorch/pytorch/issues/33641
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33689

Differential Revision: D20086967

Pulled By: ngimel

fbshipit-source-id: d9dde8edb904de1cf56b9337920cb29e008b72fb
2020-02-24 21:15:36 -08:00
54aac4af1f Update hypothesis_utils.py (#33739)
Summary:
A typo..
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33739

Differential Revision: D20088096

Pulled By: jerryzh168

fbshipit-source-id: d8b5d263c25f8c779698607be87bf76aca1811ab
2020-02-24 20:56:42 -08:00
cba8af9b24 [pytorch] Set alias analysis kind to FROM_SCHEMA for qadd, qmul, qclamp, qconcat (#33359)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33359

Updated alias analysis kind to FROM_SCHEMA so input tensors can be marked as nonmutable
when appropriate, allowing for constant folding of these tensors.

Needed to update the schemas of the _out variants with annotations to mark the output input
tensor as aliased and mutable.

Test Plan:
```
import torch

class M(torch.nn.Module):
    def __init__(self):
        super(M, self).__init__()

    def forward(self, x):
        w = torch.tensor([3], dtype=torch.float)
        w = torch.quantize_per_tensor(w, 1.0, 0, torch.qint8)
        y = torch.tensor([3], dtype=torch.float)
        y = torch.quantize_per_tensor(w, 1.0, 0, torch.qint8)
        return torch.ops.quantized.add_out(x, w, y)

m = torch.jit.script(M())
torch._C._jit_pass_constant_propagation(m.graph)
print(m.graph)
```
```
graph(%self : __torch__.___torch_mangle_9.M,
      %x.1 : Tensor):
  %11 : int = prim::Constant[value=12]() # <ipython-input-11-1dd94c30cb58>:9:49
  %9 : float = prim::Constant[value=1.]() # <ipython-input-11-1dd94c30cb58>:9:41
  %10 : int = prim::Constant[value=0]() # <ipython-input-11-1dd94c30cb58>:9:46
  %36 : QInt8(1) = prim::Constant[value={3}]()
  %y.2 : Tensor = aten::quantize_per_tensor(%36, %9, %10, %11) # <ipython-input-11-1dd94c30cb58>:11:12
  %24 : Tensor = quantized::add_out(%x.1, %36, %y.2) # <ipython-input-11-1dd94c30cb58>:12:15
  return (%24)
```
As expected, the aten::quantize_per_tensor() for w is now folded. The aten::quantize_per_tensor()
for y is not folded, since that tensor is aliased/modified.

Differential Revision: D19910667

fbshipit-source-id: 127071909573151dc664500d363399e3643441b7
2020-02-24 20:08:06 -08:00
bc5e9e0d55 [quant][graphmode][refactor] Move the check for qconfig inside insertObserver call (#32809)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32809

This is a refactor to help further changes to quantization.cpp
We want some operations on the graph happen before we call insertObserver for invoked methods,
especially `addIntermediateValuesToSkipObserver` since we want to skip the input of the ReLU
module in `Conv - ReLU` pattern.

Test Plan:
test_jit.py
test_quantization.py

Imported from OSS

Differential Revision: D20087844

fbshipit-source-id: 28b7fa0c7ce9e254ab9208eb344893fb705e14d9
2020-02-24 20:03:33 -08:00
bf00b4d305 [TensorExpr] Add a boilerplate pass for future TensorExpr fusion pass. (#33464)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33464

I added a python-exposed knob to register this pass in custom passes pipeline. If the knob is not used, the pass is not registered and thus not run at all.

Differential Revision: D19958217

Test Plan: Imported from OSS

Pulled By: ZolotukhinM

fbshipit-source-id: fecdd98567fcda069fbdf8995c796899a3dbfa5c
2020-02-24 18:47:31 -08:00
9278196d89 scatter_add uses src, not other (#32307)
Summary:
using `other` kwarg gives `TypeError: scatter_add_() missing 1 required positional arguments: "src"`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32307

Differential Revision: D20076859

Pulled By: zou3519

fbshipit-source-id: dfb417c087d5be41fad02dc0b2cf0506c89b1b02
2020-02-24 18:01:34 -08:00
98af01ee7c [quant] Make FakeQuant use REGISTER_DISPATCH (#33682)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33682

Previously, there were two API's for CPU and CUDA. This change keeps one top level API, i.e `fake_quantize_per_tensor_affine` and `fake_quantize_per_channel_affine` and uses the device type to dispatch to different backends (CPU and CUDA).
CPU kernel implementation is in QuantizedOpKernels.cpp
CUDA kernel implementation is in fake_quantize_core.cu

Test Plan:
python test/test_fake_quant.py

Benchmark Results for CPU
FakeQuantize tensor of size (2, 256, 128, 128)

Before:
per tensor quant ms 9.905877113342285
per channel quant ms 74.93825674057007

After:
per tensor quant ms 6.028120517730713
per channel quant ms 44.91588592529297

Imported from OSS

Differential Revision: D20072656

fbshipit-source-id: 0424f763775f88b93380a452e3d6dd0c90cb814b
2020-02-24 17:48:13 -08:00
b10a39bb32 Migrate _cat from TH to ATen (CUDA) (#33237)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/24520

Benchmarks:

Upstream:

```
$ python -m pt.cat_test --tag_filter all --device cuda  --omp_num_threads 1 --mkl_num_threads 1
# ----------------------------------------
# PyTorch/Caffe2 Operator Micro-benchmarks
# ----------------------------------------
# Tag : all

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes(1,1,1)_N2_dim0_cuda
# Input: sizes: (1, 1, 1), N: 2, dim: 0, device: cuda
Forward Execution Time (us) : 17.355

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes(512,512,2)_N2_dim1_cuda
# Input: sizes: (512, 512, 2), N: 2, dim: 1, device: cuda
Forward Execution Time (us) : 30.718

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes(128,1024,2)_N2_dim1_cuda
# Input: sizes: (128, 1024, 2), N: 2, dim: 1, device: cuda
Forward Execution Time (us) : 17.329

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes(512,512,2)_N2_dim1_cuda
# Input: sizes: (512, 512, 2), N: 2, dim: 1, device: cuda
Forward Execution Time (us) : 30.176

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes(1024,1024,2)_N2_dim0_cuda
# Input: sizes: (1024, 1024, 2), N: 2, dim: 0, device: cuda
Forward Execution Time (us) : 74.417

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes(1025,1023,2)_N2_dim1_cuda
# Input: sizes: (1025, 1023, 2), N: 2, dim: 1, device: cuda
Forward Execution Time (us) : 75.728

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes(1024,1024,2)_N2_dim2_cuda
# Input: sizes: (1024, 1024, 2), N: 2, dim: 2, device: cuda
Forward Execution Time (us) : 190.165

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[<function<lambda>at0x7fa8876fcf28>,111,65]_N5_dim0_cuda
# Input: sizes: [<function <lambda> at 0x7fa8876fcf28>, 111, 65], N: 5, dim: 0, device: cuda
Forward Execution Time (us) : 57.711

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[96,<function<lambda>at0x7fa886237048>,64]_N5_dim1_cuda
# Input: sizes: [96, <function <lambda> at 0x7fa886237048>, 64], N: 5, dim: 1, device: cuda
Forward Execution Time (us) : 49.903

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[128,64,<function<lambda>at0x7fa7b57bb840>]_N5_dim2_cuda
# Input: sizes: [128, 64, <function <lambda> at 0x7fa7b57bb840>], N: 5, dim: 2, device: cuda
Forward Execution Time (us) : 84.181

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[<function<lambda>at0x7fa7b57bba60>,32,64]_N50_dim0_cuda
# Input: sizes: [<function <lambda> at 0x7fa7b57bba60>, 32, 64], N: 50, dim: 0, device: cuda
Forward Execution Time (us) : 82.339

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[32,<function<lambda>at0x7fa7b57bbae8>,64]_N50_dim1_cuda
# Input: sizes: [32, <function <lambda> at 0x7fa7b57bbae8>, 64], N: 50, dim: 1, device: cuda
Forward Execution Time (us) : 82.312

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[33,65,<function<lambda>at0x7fa7b57bbb70>]_N50_dim2_cuda
# Input: sizes: [33, 65, <function <lambda> at 0x7fa7b57bbb70>], N: 50, dim: 2, device: cuda
Forward Execution Time (us) : 90.715

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes(64,32,4,16,32)_N2_dim2_cuda
# Input: sizes: (64, 32, 4, 16, 32), N: 2, dim: 2, device: cuda
Forward Execution Time (us) : 129.021

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes(16,32,4,16,32)_N8_dim2_cuda
# Input: sizes: (16, 32, 4, 16, 32), N: 8, dim: 2, device: cuda
Forward Execution Time (us) : 142.966

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes(9,31,5,15,33)_N17_dim4_cuda
# Input: sizes: (9, 31, 5, 15, 33), N: 17, dim: 4, device: cuda
Forward Execution Time (us) : 387.023

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[<function<lambda>at0x7fa7b57bbbf8>]_N100_dim0_cuda
# Input: sizes: [<function <lambda> at 0x7fa7b57bbbf8>], N: 100, dim: 0, device: cuda
Forward Execution Time (us) : 36.647

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[<function<lambda>at0x7fa7b57bbc80>]_N1000_dim0_cuda
# Input: sizes: [<function <lambda> at 0x7fa7b57bbc80>], N: 1000, dim: 0, device: cuda
Forward Execution Time (us) : 278.890

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[<function<lambda>at0x7fa7b57bbd08>]_N2000_dim0_cuda
# Input: sizes: [<function <lambda> at 0x7fa7b57bbd08>], N: 2000, dim: 0, device: cuda
Forward Execution Time (us) : 557.752

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[<function<lambda>at0x7fa7b57bbd90>]_N3000_dim0_cuda
# Input: sizes: [<function <lambda> at 0x7fa7b57bbd90>], N: 3000, dim: 0, device: cuda
Forward Execution Time (us) : 842.512

```

New version:

```
$ python -m pt.cat_test --tag_filter all --device cuda  --omp_num_threads 1 --mkl_num_threads 1
# ----------------------------------------
# PyTorch/Caffe2 Operator Micro-benchmarks
# ----------------------------------------
# Tag : all

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes(1,1,1)_N2_dim0_cuda
# Input: sizes: (1, 1, 1), N: 2, dim: 0, device: cuda
Forward Execution Time (us) : 24.419

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes(512,512,2)_N2_dim1_cuda
# Input: sizes: (512, 512, 2), N: 2, dim: 1, device: cuda
Forward Execution Time (us) : 25.025

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes(128,1024,2)_N2_dim1_cuda
# Input: sizes: (128, 1024, 2), N: 2, dim: 1, device: cuda
Forward Execution Time (us) : 24.247

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes(512,512,2)_N2_dim1_cuda
# Input: sizes: (512, 512, 2), N: 2, dim: 1, device: cuda
Forward Execution Time (us) : 25.098

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes(1024,1024,2)_N2_dim0_cuda
# Input: sizes: (1024, 1024, 2), N: 2, dim: 0, device: cuda
Forward Execution Time (us) : 74.441

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes(1025,1023,2)_N2_dim1_cuda
# Input: sizes: (1025, 1023, 2), N: 2, dim: 1, device: cuda
Forward Execution Time (us) : 74.866

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes(1024,1024,2)_N2_dim2_cuda
# Input: sizes: (1024, 1024, 2), N: 2, dim: 2, device: cuda
Forward Execution Time (us) : 189.280

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[<function<lambda>at0x7f1c9b056048>,111,65]_N5_dim0_cuda
# Input: sizes: [<function <lambda> at 0x7f1c9b056048>, 111, 65], N: 5, dim: 0, device: cuda
Forward Execution Time (us) : 57.629

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[96,<function<lambda>at0x7f1c9b0560d0>,64]_N5_dim1_cuda
# Input: sizes: [96, <function <lambda> at 0x7f1c9b0560d0>, 64], N: 5, dim: 1, device: cuda
Forward Execution Time (us) : 49.975

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[128,64,<function<lambda>at0x7f1bce8f38c8>]_N5_dim2_cuda
# Input: sizes: [128, 64, <function <lambda> at 0x7f1bce8f38c8>], N: 5, dim: 2, device: cuda
Forward Execution Time (us) : 83.643

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[<function<lambda>at0x7f1bce8f3ae8>,32,64]_N50_dim0_cuda
# Input: sizes: [<function <lambda> at 0x7f1bce8f3ae8>, 32, 64], N: 50, dim: 0, device: cuda
Forward Execution Time (us) : 82.307

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[32,<function<lambda>at0x7f1bce8f3b70>,64]_N50_dim1_cuda
# Input: sizes: [32, <function <lambda> at 0x7f1bce8f3b70>, 64], N: 50, dim: 1, device: cuda
Forward Execution Time (us) : 82.323

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[33,65,<function<lambda>at0x7f1bce8f3bf8>]_N50_dim2_cuda
# Input: sizes: [33, 65, <function <lambda> at 0x7f1bce8f3bf8>], N: 50, dim: 2, device: cuda
Forward Execution Time (us) : 90.549

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes(64,32,4,16,32)_N2_dim2_cuda
# Input: sizes: (64, 32, 4, 16, 32), N: 2, dim: 2, device: cuda
Forward Execution Time (us) : 129.022

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes(16,32,4,16,32)_N8_dim2_cuda
# Input: sizes: (16, 32, 4, 16, 32), N: 8, dim: 2, device: cuda
Forward Execution Time (us) : 142.969

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes(9,31,5,15,33)_N17_dim4_cuda
# Input: sizes: (9, 31, 5, 15, 33), N: 17, dim: 4, device: cuda
Forward Execution Time (us) : 386.973

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[<function<lambda>at0x7f1bce8f3c80>]_N100_dim0_cuda
# Input: sizes: [<function <lambda> at 0x7f1bce8f3c80>], N: 100, dim: 0, device: cuda
Forward Execution Time (us) : 43.800

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[<function<lambda>at0x7f1bce8f3d08>]_N1000_dim0_cuda
# Input: sizes: [<function <lambda> at 0x7f1bce8f3d08>], N: 1000, dim: 0, device: cuda
Forward Execution Time (us) : 279.023

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[<function<lambda>at0x7f1bce8f3d90>]_N2000_dim0_cuda
# Input: sizes: [<function <lambda> at 0x7f1bce8f3d90>], N: 2000, dim: 0, device: cuda
Forward Execution Time (us) : 565.790

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[<function<lambda>at0x7f1bce8f3e18>]_N3000_dim0_cuda
# Input: sizes: [<function <lambda> at 0x7f1bce8f3e18>], N: 3000, dim: 0, device: cuda
Forward Execution Time (us) : 845.153
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33237

Differential Revision: D20069181

Pulled By: ngimel

fbshipit-source-id: b392e1ffd72c0d8df0c5a2d3ac96f59b37c84e32
2020-02-24 17:41:16 -08:00
97da60d511 Updating submodules
Summary:
GitHub commits:

ea8bae1f0f
134472ee45
37e6cf9d62
eb367d45c0
76de6e15c0
e1b1a55309

Test Plan: n/a

Reviewed By: wittgenst

fbshipit-source-id: 9d0d688d81be822900475223a787c5649e143e85
2020-02-24 17:34:59 -08:00
479e474a37 [quant][graphmode] FoldConvBatchNorm2d support shared ClassTypes (#32379)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32379

Folding Conv2d - BatchNorm2d modules means recalculate the weight and bias of Conv2d module by incorproating the parameters
of BatchNorm2d, and also change the method calls to calling only forward of Conv2d module, this involves change of both module
types and graph because the bias of Conv2d is a parameter when it has value and is an attribute when it is
None(since JIT code has assumption of prameter being Tensor in multiple places), therefore
we'll need to remove the bias attribute when it is None and add a bias attribute later. Since ClassType might be shared, we separate
remove and add in separate steps and also keep track of the processed graph to avoid modifying the graph and type multiple times.
However we'll have to record the slot index of bias as well so we can replay the slot removal on other instances of Conv2d module.

Test Plan:
tbd

Imported from OSS

Differential Revision: D20078719

fbshipit-source-id: cee5cf3764f3e0c0a4a2a167b78dbada2e3835cc
2020-02-24 17:29:13 -08:00
54e41a87eb Make ELU great again (#33244)
Summary:
Due to compiler bug, we have to make some workaround on ELU for CUDA. A necessary condition for this bug to happen is `invoke_with_array` in `Loops.cuh`. Now, https://github.com/pytorch/pytorch/issues/33222 will kill that function, and we need to remove that workaround once https://github.com/pytorch/pytorch/issues/33222 is landed.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33244

Differential Revision: D20076197

Pulled By: ngimel

fbshipit-source-id: 39f99783014c78cecad1c39cb46092278ff220b9
2020-02-24 17:18:30 -08:00
5b031d961d [pt][quant] RNN debug test (#33621)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33621

ghstack-source-id: 98746093

Test Plan: buck test mode/dev caffe2/test:quantization -- 'test_quantized_rnn \(test_quantization\.PostTrainingDynamicQuantTest\)'  --print-passing-details

Differential Revision: D20036968

fbshipit-source-id: 7cbb027a6afbe28bc250fc663089c6a9406e880b
2020-02-24 16:15:17 -08:00
696527e659 [caffe2] Add embedding empty ratio checker (disabled by default) (#33145)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/33145

Reviewed By: xianjiec

Differential Revision: D19716574

fbshipit-source-id: 42a636600ac3977910d35093916865790bbe5b10
2020-02-24 16:10:01 -08:00
5090d7082b add propagate flag USE_DISTRIBUTED for libtorch_python_source
Reviewed By: pritamdamania87

Differential Revision: D20070789

fbshipit-source-id: fdb8a2eefb5bfc1ae1d80e29bd15eb1d70920c87
2020-02-24 16:02:47 -08:00
330b69fef8 Kill dead scalar_check. (#33695)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33695

I'm not sure how this stuck around, but it has no effect.

Test Plan: Imported from OSS

Differential Revision: D20068867

Pulled By: gchanan

fbshipit-source-id: 79191338a8bc7a195e2b7265005ca6f00aab3818
2020-02-24 14:53:24 -08:00
996c0adb53 [quant] Regsiter fake_quant and observer attributes as buffers (#33626)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33626

For DDP we require the attributes to be registered as buffers. By doing this the value is broadcast from one device to the rest.

Test Plan:
Tested on actual model on GPU

Imported from OSS

Differential Revision: D20038839

fbshipit-source-id: 82e829fc3baca0b3262c3894a283c375eb08a4a4
2020-02-24 14:16:03 -08:00
dc3d47110a [docs] add experimental warning to TorchScript classes in language reference (#33697)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33697

reference

Test Plan: Imported from OSS

Differential Revision: D20070220

Pulled By: suo

fbshipit-source-id: 9828d876afed59203cc472eaf0134d52d399069e
2020-02-24 14:01:19 -08:00
533b973fd0 Fix visibility of torch::nn::RNNImpl::options (#33718)
Summary:
In PR https://github.com/pytorch/pytorch/issues/33027, `options` in RNNImpl was mistakenly changed to `protected` (it was `public` before)

```
 protected:
  FORWARD_HAS_DEFAULT_ARGS({1, AnyValue(Tensor())})

  RNNOptions options;
```

This PR changes it back to `public` again.

Fixes https://github.com/pytorch/pytorch/issues/33694.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33718

Differential Revision: D20075149

Pulled By: yf225

fbshipit-source-id: 82901369eeaacd82df849e17df64dc1aaf98f9fe
2020-02-24 13:50:39 -08:00
062ac6b472 Bring up new-style registration API as wrapper around old-style (#33205)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33205

A number of important use-cases are implemented:

- def(schema): defines a schema, with no implementation (alias
  inferred from schema, by default)
- def(schema, fn_ptr): registers fn_ptr as a catch-all kernel
  for the operation
- def(schema, lambda): registers lambda as a catch-all kernel
  for the operation
- def(schema, torch::dispatch(dispatch_key, fn)), and
  def(schema, torch::dispatch(device_type, fn)): registers
  the function to only be executed when dispatch_key/device_type
  is selected for use
- def(schema, TORCH_OPTIMIZED_FN(fn)): registers the function
  as unboxed only, using the inline syntax

All of our code generated registrations in ATen are switched to
the new API.

Some aspects of the API which are not fully implemented:

- It's still not valid to omit the schema when registering a function
  pointer, due to #32549
- Although it's possible to take advantage of top-level namespaces
  ala torch::import("aten"), we don't use it because this results
  in worse code (as we have to cat everything back together).  This
  is not an essential problem, we just need the internals to be less
  stupid.

There are some aspects of the API which don't semantically make sense,
but I chose not to fix them in this PR:

- For some reason, TORCH_OPTIMIZED_FN uses the *runtime* wrapper to
  do wrapping, rather than the compile time one which inlines the
  function in.  This means that there isn't any reason we should be
  passing in the function pointer as a template argument; a regular
  old argument ought to have worked fine.  This is seemingly
  consistent with the current API though; needs further investigation.
- There's no reason to optional<DispatchKey>, DispatchKey would
  work just fine (use DispatchKey::Undefined for the nullopt case)

In the long term, we should swap the wrapper around: the new-style
API has the real implementation, and the old-style API is backwards
compatibility.  However, this implies a lot of internal refactoring,
so I decided to short circuit around it to get this in faster

Ancillary changes:
- I stopped moving optional<DispatchKey>, it's literally just two
  words, pass it by value please.
- Needed to add a & qualified version of RegisterOps::op, since
  I'm storing RegisterOps as a member inside the new style
  Namespace and I cannot conveniently get a rvalue reference
  to it in that situation.  (BTW, register_ = std::move(register_)
  really doesn't work, don't try it!)

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Differential Revision: D19856626

Pulled By: ezyang

fbshipit-source-id: 104de24b33fdfdde9447c104853479b305cbca9a
2020-02-24 11:45:14 -08:00
ced8865d91 Add sigmoid to mobile ops
Summary: Used by segmentation model.

Test Plan: Ran segmentation model on mobile.

Reviewed By: iseeyuan

Differential Revision: D19881378

fbshipit-source-id: 87f00058050fd173fbff1e88987ce09007622b83
2020-02-24 11:37:24 -08:00
32c93099c4 Add typing info for data members of utils.data.sampler classes (#33679)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/33490
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33679

Differential Revision: D20063099

Pulled By: ngimel

fbshipit-source-id: 1bbf71a65408d117019ab38d7d095cfd337f5d1e
2020-02-24 11:29:59 -08:00
4d9b649261 jit pickling rref (#32959)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32959

in rpc torch script call path, we need to pickle/unpickle rref, this diff is added to make jit pickler/unpickler be able to pickle/unpickle rref. It is similar to what is implemented for PyRef::pickle() and PyRef::unpickle().
The pickling/unpickling design assumes it is always coupled with RPC calls. It is not needed to checkpoint a model with rref, before checkpointing the model, user should call ref.to_here() to get value inside rref.

The pickling process is:
1. push torch.distributed.rpc.rref global string
1. call rref.fork() and create rrefForkData, which is a few IDs and type str of the value held inside the rref, the IDs includes rref id, fork id, caller work id, callee work id, owner work id
2. push the rrefForkData

The unpickling process is:
1. read torch.distributed.rpc.rref global string, and retrieve the cached global lamda function
2. the globa lamda function will get rrefForkData
3. if callee is also owner work id, then get owner rref based on Ids inside rrefFork data and return the ownerRRef
4. if callee is not owner work id, then create user rref using the rrefForkData and return the userRRef
5. meanwhile owner rref will be notified and do reference counting correctly

During unpickling, a type_resolver is needed to parse type str. This type_resolver has python dependency, so we get it from rpc_agent, and pass it to unpickler during construction. So we added a type_resolver argumenmt to jit unpickler constructor in this diff.
ghstack-source-id: 98814793

Test Plan: unit test

Differential Revision: D19713293

fbshipit-source-id: 4fd776cdd4ce8f457c4034d79acdfb4cd095c52e
2020-02-24 11:16:35 -08:00
481e7f2e78 catch and propagate warnings for JIT ScriptMethods (#33010)
Summary:
We align it with ScriptFunctions by using the HANDLE_TH_ERRORS/END_HANDLE_TH_ERRORS_PYBIND macros.

Fixes https://github.com/pytorch/pytorch/issues/24155  or https://github.com/pytorch/pytorch/issues/24828 ?
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33010

Differential Revision: D20053585

Pulled By: suo

fbshipit-source-id: c8876b54069285ba9638bb2328fd8738b59c396d
2020-02-24 10:28:17 -08:00
6a76433b9d [Update independent.py]add explicit string representation (#33676)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/33676

Differential Revision: D20069202

Pulled By: ngimel

fbshipit-source-id: 48b609d4fb7a098e9e3383553103a9441673d63f
2020-02-24 10:15:00 -08:00
6a275b696e adding IterableDataset to utils.data.__init__ (#33543)
Summary:
this shall fix issue https://github.com/pytorch/pytorch/issues/27820 again
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33543

Differential Revision: D20002446

Pulled By: vincentqb

fbshipit-source-id: 7563a56fd6238efe8ea5626b02ba5e8fcda0780e
2020-02-24 10:09:38 -08:00
e3ba533c8b Minimize the cases where we have to cpu_zero. (#33570)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33570

In this PR, we are a bit more careful about avoiding zero-ing the output.  Analysis as follows:
1) `mm` doesn't need zero_ because it never calls scal, which is the underlying problem.
2) for `mv`, which does call scal (in certain cases), we can just move the zeroing to where it would actually be a problem, namely when the scalar value is 0.
In this case we just run the non-BLAS version of the code.

Test Plan: Imported from OSS

Differential Revision: D20007665

Pulled By: gchanan

fbshipit-source-id: 1f3a56954501aa9b2940d2f4b35095b2f60089a8
2020-02-24 07:47:36 -08:00
641750e33c Fix NaN handling in torch.mv. (#31666)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31666

List of changes:
1) Fix a case where torch.mv was not handling NaNs correctly.  In particular, with a transposed tensor and expanded vector, NaNs in the output are kept, even if beta = 0.
This is handled in the `out=` case by zero-ing out the passed-in Tensor, but this can happen just the same with the non-out variant if the allocated tensor happens to have a NaN.
Also adds tests for this case.
NOTE: we zero out the output tensor in all cases for mv and mm, even though this is probably overkill.  I didn't find another case where this would be a problem, but the old code at least
attempted to do this for all mv and mm calls and I didn't add comprehensive testing to be sure that it's not a problem.

2) on CPU: move mv, mv_out, mm, mm_out to be direct wrappers on _th_addmv, _th_addmm, rather than having their own wrappers in Declarations.cwrap.
Ths is to remove the magic around cpu_zero from the codegen, which simplifies the codegen and makes testing this easier.

Test Plan: Imported from OSS

Differential Revision: D19239953

Pulled By: gchanan

fbshipit-source-id: 27d0748d215ad46d17a8684696d88f4cfd8a917e
2020-02-24 07:46:08 -08:00
039dc90854 Revert D19521853: [pytorch][PR] Mobile Backend: NHWC memory layout + XNNPACK integration.
Test Plan: revert-hammer

Differential Revision:
D19521853

Original commit changeset: 99a1fab31d0e

fbshipit-source-id: 76dfc1f481797ba2386997533cf19957637687d6
2020-02-23 22:07:19 -08:00
9d834cc889 [JIT] Fix FunctionType::python_str() (#33680)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/33680

Test Plan: Imported from OSS

Differential Revision: D20062777

Pulled By: jamesr66a

fbshipit-source-id: fcdb0527ca6776ff161cd535794e9c12bb32bdde
2020-02-23 21:52:09 -08:00
5fa03d4dbb Fix bug where we were trying to get a schema for prim::Constant, which is not registered as an operator. (#33645)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33645

Fix bug where we were trying to get a schema for prim::Constant, which is not registered as an operator.
ghstack-source-id: 98785729

Test Plan: buck test mode/dev //pytext/models/test:scripted_seq2seq_generator_test -- 'test_generator \(pytext\.models\.test\.scripted_seq2seq_generator_test\.ScriptedSeq2SeqGeneratorTest\)'

Differential Revision: D20050833

fbshipit-source-id: cc38510b0135b750fdf57fb9c1e66ce1d91ee128
2020-02-23 21:37:35 -08:00
e1bddbbaf6 Bounds checking for functor execution in vectorized/unrolled kernels (#33642)
Summary:
The current logic for vectorized/unrolled operations in CUDALoops.cuh applies bounds checking to loads and stores, [but not to the actual functor's execution](16d6c17845/aten/src/ATen/native/cuda/CUDALoops.cuh (L264)).  In other words, for a block acting on the tail of a tensor that doesn't require the whole block to participate in memory transactions, many threads execute their functor on uninitialized data.  For functors that only communicate with the outside world via the bounds-checked loads and stores, that's ok.  The threads acting on garbage data never actually write their results.  But [my proposed inf/nan checking kernel](https://github.com/pytorch/pytorch/pull/33366/files#diff-9701a2b34900195d160bdc234e001b79R70-R79) has the additional side effect of writing to a `found_inf` flag in global memory.  For irregularly-shaped tensors where tail threads execute the functor on garbage data, these threads would sometimes see and report spurious infs/nans.

In general, we can't guarantee functors won't have side effects.  For safety (and efficiency) we should apply bounds checking to the functor execution as well as the loads and stores.

Is it possible that other elementwise kernels (in addition to the strided/vectorized implementation) are also executing functors unconditionally?  That would cause similar failures.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33642

Differential Revision: D20062985

Pulled By: ngimel

fbshipit-source-id: 65b8d75a001ce57865ed1c0cf89105d33f3f4dd4
2020-02-23 21:17:31 -08:00
941b42428a Mobile Backend: NHWC memory layout + XNNPACK integration. (#32509)
Summary:
In order to improve CPU performance on floating-point models on mobile, this PR introduces a new CPU backend for mobile that implements the most common mobile operators with NHWC memory layout support through integration with XNNPACK.

XNNPACK itself, and this codepath, are currently only included in the build, but the actual integration is gated with USE_XNNPACK preprocessor guards.  This preprocessor symbol is intentionally not passed on to the compiler, so as to enable this rollout in multiple stages in follow up PRs.  This changeset will build XNNPACK as part of the build if the identically named USE_XNNPACK CMAKE variable, defaulted to ON, is enabled, but will not actually expose or enable this code path in any other way.

Furthermore, it is worth pointing out that in order to efficiently map models to these operators, some front-end method of exposing this backend to the user is needed.  The less efficient implementation would be to hook these operators into their corresponding **native** implementations, granted that a series of XNNPACK-specific conditions are met, much like how NNPACK is integrated with PyTorch today for instance.

Having said that, while the above implementation is still expected to outperform NNPACK based on the benchmarks I ran, the above integration would be leave a considerable gap between the performance achieved and the maximum performance potential XNNPACK enables, as it does not provide a way to compute and factor out one-time operations out of the inner most forward() loop.

The more optimal solution, and one we will  decide on soon, would involve either providing a JIT pass that maps nn operators onto these newly introduced operators, while allowing one-time calculations to be factored out, much like quantized mobile models.  Alternatively, new eager-mode modules can also be introduced that would directly call into these implementations either through c10 or some other mechanism, also allowing for decoupling of op creation from op execution.

This PR does not include any of the front end changes  mentioned above.  Neither does it include the mobile threadpool unification present in the original https://github.com/pytorch/pytorch/issues/30644.  Furthermore, this codepath seems to be faster than NNPACK in a good number of use cases, which can potentially allow us to remove NNPACK from aten to make the codebase a little simpler, granted that there is widespread support for such a move.

Regardless, these changes will be introduced gradually and in a more controlled way in subsequent PRs.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32509

Reviewed By: dreiss

Differential Revision: D19521853

Pulled By: AshkanAliabadi

fbshipit-source-id: 99a1fab31d0ece64961df074003bb852c36acaaa
2020-02-23 19:08:42 -08:00
7aa605ed92 Remove uses of .data in test_torch (#33638)
Summary:
Removes almost every usage of `.data` in test_torch to address part of https://github.com/pytorch/pytorch/issues/33629.

Lines 4706-4710 had to be refactored to allow this. The changed test is fundamentally the same, as it appears to be meant to confirm that using an input of a different type than the weight causes an appropriate error.

There is one remaining usage of `.data`, and it is on line 5132. This was left as the `set_` and `resize_` methods still mention `.data` explicitly. I figure the right time to remove this is when those methods have their runtime errors updated.

Note: ~~some tests are skipped locally, and so I am still verifying that nothing has been obviously broken.~~ Appears to be passing early tests.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33638

Differential Revision: D20062288

Pulled By: albanD

fbshipit-source-id: 672a6d7a20007baedb114a20bf1ddcf6c4c0a16a
2020-02-23 14:11:21 -08:00
6d448acb34 [PyTorch BC] Skip aten::random_ to fix BC CI (#33666)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33666

it's caused by a revert. So let's skip it.

Test Plan: ci

Reviewed By: hl475

Differential Revision: D20057382

fbshipit-source-id: d71af8efe68b31befcef5dddc372540e8a8ae2ac
2020-02-22 21:28:18 -08:00
9e384f9ce4 Remove duplicate header include. (#33656)
Summary:
The same header `<torch/nn/functional/conv.h>` is included twice.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33656

Differential Revision: D20056913

Pulled By: yf225

fbshipit-source-id: b1563035c9821731b99c26eec130ff0b9cc627a7
2020-02-22 14:17:07 -08:00
312627a7c3 Revert D19776613: Migrate random_ from the TH to Aten (CPU)
Test Plan: revert-hammer

Differential Revision:
D19776613

Original commit changeset: a8d262bccf5f

fbshipit-source-id: 36389ffa3d8377743f55f97221d7a7ee25a409f6
2020-02-22 08:15:27 -08:00
a2f3c6c26f Call RandomNumberSeed() on-demand (#33539)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33539

We rarely use the `random_seed_` in context but we always initialize it with `RandomNumberSeed()` which isn't trivial. This diff makes it that we only call `RandomNumberSeed()` once when we want to use `random_seed_`.

Test Plan:
unittests.

Canaries:
AF: https://our.intern.facebook.com/intern/ads/canary/424753437441438410
AI: https://our.intern.facebook.com/intern/ads/canary/424753467414318838
Prospector: https://our.intern.facebook.com/intern/ads/canary/424753976999968569

Reviewed By: ipiszy

Differential Revision: D19993190

fbshipit-source-id: 1d2606bd65476ff3b519c69f9cbfa3b80f75cdff
2020-02-22 01:22:18 -08:00
8291e06f8f Fixes cuda->numpy and non-strided->numpy segfaults (#33612)
Summary:
Addresses https://github.com/pytorch/pytorch/issues/33300.

Calling .numpy() on a CUDA or non-strided (e.g. sparse) tensor segfaults in current PyTorch. This fixes the segfaults and throws the appropriate TypeError, as was intended.

Two tests, one in test_cuda.py and the other in test_sparse.py, are added to verify the behavior.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33612

Differential Revision: D20038210

Pulled By: mruberry

fbshipit-source-id: 265531dacd37c392232fd3ec763489a62ef54795
2020-02-21 22:23:08 -08:00
59daf1611b [Caffe2] Skip //caffe2/caffe2:caffe2_test_cpu -- 'DBSeekTest\.RocksDB'
Summary: Skip the test to unblock dper fbpkg push

Test Plan: buck test //caffe2/caffe2:caffe2_test_cpu -- 'DBSeekTest\.RocksDB' --run-disabled

Reviewed By: cheshen1

Differential Revision: D20043418

fbshipit-source-id: 05ceb2cea08722a671fa211d73680fd4b78f354c
2020-02-21 21:30:02 -08:00
1c08fa7051 [Caffe2] Skip caffe2/caffe2:caffe2_test_cpu - DBSeekTest.LMDB
Summary: skip broken tests in https://fburl.com/svc/zsbsrc7a to unblock dper fbpkg push.

Test Plan: buck test //caffe2/caffe2:caffe2_test_cpu -- 'DBSeekTest\.LMDB' --run-disabled

Reviewed By: cheshen1

Differential Revision: D20042330

fbshipit-source-id: 5b86e66da2a219c915c471b8e87f33239bdc5ba9
2020-02-21 21:28:31 -08:00
a7e22b4c6a add bailout checks to checkScript (#32802)
Summary:
this adds enough infrastructure to run bailout checks in `checkScript`. I'll need to figure out the best way to enable it for nightly builds now.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32802

Differential Revision: D19974718

Pulled By: Krovatkin

fbshipit-source-id: 40485503f6d3ae14edcce98e1eec1f0559f3ad08
2020-02-21 21:18:54 -08:00
9b2b15f4fc misc windows warning fixes (#33632)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33632

* `inline_container.h` was unnecessarily exposing all includers to caffe2 headers via `caffe2/core/logging.h`
* Add msvc version of hiding unused warnings.
* Make sure clang on windows does not use msvc pragmas.
* Don't redefine math macro.

Test Plan: CI green

Differential Revision: D20017046

fbshipit-source-id: 230a9743eb88aee08d0a4833680ec2f01b7ab1e9
2020-02-21 19:36:25 -08:00
d971007c29 Migrate random_ from the TH to Aten (CPU) (#32534)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32534

Fixes #24752
Fixes #32510

Test Plan: Imported from OSS

Differential Revision: D19776613

Pulled By: pbelevich

fbshipit-source-id: a8d262bccf5f2807f6125c83080aa16d77491b19
2020-02-21 16:13:58 -08:00
e10aa6b72f Fix flaky DagNetTest unittest
Summary: The first run of the net is noisy sometimes - just run it twice.

Reviewed By: cheshen1

Differential Revision: D20039274

fbshipit-source-id: 639e65646bf52f3efe1ecd4bbcd0e413d9389b29
2020-02-21 16:08:04 -08:00
6474ea404d [C2] Native GPU implementation for bucketize (#33529)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33529

Current version goes through GPU -> CPU -> GPU copy and is pretty slow: ~19 ms
for 1M elements with 20 possible buckets based on benchmark.

This new version is ~0.2 on the same

Test Plan: benchmark + unit-test

Reviewed By: chocjy

Differential Revision: D19969518

fbshipit-source-id: 51889bc9a232b6d45d9533e53b7b7f4531da481f
2020-02-21 15:47:04 -08:00
15ba902c08 Turn ONNX_ML into a proper build option. (#33424)
Summary:
The detection of the env variable ONNX_ML has been properly handled in tools/setup_helpers/cmake.py,
line 242.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33424

Differential Revision: D20043991

Pulled By: ezyang

fbshipit-source-id: 91d1d49a5a12f719e67d9507cc203c8a40992f03
2020-02-21 15:42:33 -08:00
16d6c17845 improve roll performance (#33623)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/33544
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33623

Differential Revision: D20037643

Pulled By: ngimel

fbshipit-source-id: 9fd293eca5242daf414c116344b2e1fde9f9ebc5
2020-02-21 15:09:51 -08:00
f62f1b2ef0 Revert "Revert D19964089: [pytorch][PR] Allow vectorized gpu loop to … (#33553)
Summary:
…have different argument types"

This reverts commit 05fb160048b71c1b8b00d2083a08618318158c1a.

Please go to https://github.com/pytorch/pytorch/pull/33558 and check the CUDA9 on CI
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33553

Differential Revision: D20017575

Pulled By: ngimel

fbshipit-source-id: a5fd78eea00c7b0925ab21fd90a7daeb66725f1a
2020-02-21 14:56:30 -08:00
a72946dbab Stop generating out full function type for registration, use decltype or infer it (#33097)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33097

Previously, we had to specify full types because the functions we registering
might be overloaded, and the type was necessary to resolve the ambiguity.  I
disambiguate all of these names by mangling the names of the methods we
place on CPUType/CUDAType/TypeDefault with the overload name (these are
*internal* wrappers which are not user visible), and then can strip
the generation of full function types from the registration.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Differential Revision: D19837898

Pulled By: ezyang

fbshipit-source-id: 5f557184f6ec84cb0613d4eb2e33b83fd1712090
2020-02-21 14:26:14 -08:00
22963f42ec Delete unnecessary aliasAnalysis specification from operator registrations. (#33093)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33093

In #30187 the aliasAnalysis field on operator registration was updated
so that alias analysis could be specified in only some registration call
sites, rather than requiring it be consistently specified in all call
sites.  With this change, we can eliminate the requirement that all
registrations specify aliasAnalysis; as long as we know *one* site
specifies the correct aliasAnalysis, we don't have to specify it
any of the other sites.

In this patch, the "one site" is TypeDefault.cpp (previously we only
generated these stub declarations for manually registered functions,
but now we generate the stubs for everything).  Then I delete aliasAnalysis
anywhere we register an op for an existing function (which is a lot
of places).

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Differential Revision: D19837897

Pulled By: ezyang

fbshipit-source-id: 26a7fbc809ec1553da89ea5c0361f3e81526d4c2
2020-02-21 14:24:44 -08:00
d5b768dffd refactor strongTypePtr (#33590)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33590

ghstack-source-id: 98713798

Test Plan: unit test

Differential Revision: D20015521

fbshipit-source-id: 8c744a6f30f12671bef89c3555110ce26609d9a3
2020-02-21 13:32:18 -08:00
47e90d774e C++/Python API Parity: add pad_sequence (#32387)
Summary:
- add `pad_sequence` and tests
- related issue https://github.com/pytorch/pytorch/issues/25883
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32387

Differential Revision: D20025421

Pulled By: yf225

fbshipit-source-id: caa9ae2114bece8db387a3a1610f24a3e06b1324
2020-02-21 13:16:09 -08:00
bb5181b716 [TensorExpr] Add IR Printer. (#33220)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/33220

Test Plan: Imported from OSS

Differential Revision: D19848379

Pulled By: ZolotukhinM

fbshipit-source-id: 1c6ab4f63080d4506dedc3c47938de92fb4bfba2
2020-02-21 13:10:26 -08:00
fc70fc3610 [TensorExpr] Add IR visitor, IR mutator, and IR evaluator. (#33219)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/33219

Test Plan: Imported from OSS

Differential Revision: D19848381

Pulled By: ZolotukhinM

fbshipit-source-id: 44ca7cd99c25e290a8ffd8146785c19f9c785dfd
2020-02-21 13:10:22 -08:00
49af9425a7 [TensorExpr] Add core classes for representing expressions and statements. (#33218)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/33218

Test Plan: Imported from OSS

Differential Revision: D19848378

Pulled By: ZolotukhinM

fbshipit-source-id: 48399f8651324d5ad0607e08573d5d7b2026bb23
2020-02-21 13:10:17 -08:00
1a4f997178 [TensorExpr] Add a class for representing data type. (#33217)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/33217

Test Plan: Imported from OSS

Differential Revision: D19848380

Pulled By: ZolotukhinM

fbshipit-source-id: d8683f8fc4555d2456cd2a7c827d8e8231915b49
2020-02-21 13:10:12 -08:00
089d658153 [TensorExpr] Add classes for memory management in tensor expressions. (#33216)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33216

All tensor expressions belong to a kernel arena and are freed when the
arena is destroyed. Until it is destroyed, all expressions stay valid.

Test Plan: Imported from OSS

Differential Revision: D19848382

Pulled By: ZolotukhinM

fbshipit-source-id: a581ea2b635b9ba2cc53949616a13d8d3a47caae
2020-02-21 13:08:50 -08:00
616beb1412 [ROCm] Added support for pytorch extensions to use HIP (#32669)
Summary:
This pull request has changes for:
1. Enabling a torch module with HIP code to be compiled by cpp_extensions.py
2. Fixes for hipify module to be able to be used by a torch extension

cc: ezyang iotamudelta jeffdaily
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32669

Differential Revision: D20033893

Pulled By: zou3519

fbshipit-source-id: fd6ddc8cdcd3930f41008636bb2bc9dd26cdb008
2020-02-21 12:10:02 -08:00
ca8e025cdf improve the doc of enforce_sorted in pack_padded_sequence (#33617)
Summary:
this is a follow up PR to https://github.com/pytorch/pytorch/issues/33602:

torch/nn/utils/rnn.html:

`pack_padded_sequence` has a confusing and incomplete description of the `enforce_sorted` param. Currently it goes:

```
        enforce_sorted (bool, optional): if ``True``, the input is expected to
            contain sequences sorted by length in a decreasing order. If
            ``False``, this condition is not checked. Default: ``True``.
```

The second part "this condition is not checked" (1) makes no sense since the alluded to condition is not described and (2) it's incomplete as it doesn't reflect the important part, that it actually does the sorting. I think it should say something like:

```
        enforce_sorted (bool, optional): if ``True``, the input is expected to
            contain sequences sorted by length in a decreasing order. If
            ``False``, the input will get sorted unconditionally. Default: ``True``.
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33617

Differential Revision: D20035131

Pulled By: albanD

fbshipit-source-id: 654382eb0cb62b5abc78497faa5b4bca42db5fda
2020-02-21 11:51:08 -08:00
293fa5fc44 [Documentation] Fix minor typo in torch.serialization (#33549)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/33549

Differential Revision: D20002545

Pulled By: albanD

fbshipit-source-id: 46fe2002329e5250c009eb066432909b71ecd74d
2020-02-21 09:29:13 -08:00
e77abb9a5b Normalize reward-to-go in C++ actor-critic (#33550)
Summary:
Comparing to the [Python implementation](https://github.com/pytorch/examples/blob/master/reinforcement_learning/actor_critic.py), it seems like the tensor of normalized reward-to-go is computed but never used. Even if it's just an integration test, this PR switches to the normalized version for better convergence.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33550

Differential Revision: D20024393

Pulled By: yf225

fbshipit-source-id: ebcf0fee14ff39f65f6744278fb0cbf1fc92b919
2020-02-21 09:19:39 -08:00
ee28831341 [jit] Fix aug assign for non-tensor attributes (#32993)
Summary:
Instead of erroring out this de-sugars augmented assignments to class
members from `self.a += 1` to `self.a = self.a + 1`.

Fixes #32973
](https://our.intern.facebook.com/intern/diff/19737636/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32993

Pulled By: driazati

Differential Revision: D19737636

fbshipit-source-id: 07307cde88d8c348a7affdafe26db21c74e28ec0
2020-02-21 08:42:35 -08:00
fa80299bdf __torch_function__ overrides for torch.functional and torch.nn.functional (#32799)
Summary:
This adds `__torch_function__` support for all functions in `torch.functional` and `torch.nn.functional`.

The changes to C++ code and codegen scripts are to facilitate adding `__torch_function__` support for the native functions in `torch._C._nn`. Note that I moved the `handle_torch_function` C++ function to a header that both `python_torch_functions.cpp` and `python_nn_functions.cpp` include. The changes to `python_nn_functions.cpp` mirror the changes I made to `python_torch_functions.cpp` when `__torch_function__` support was first added in https://github.com/pytorch/pytorch/issues/27064. Due to the somewhat different way the `torch._C` and `torch._C._nn` namespaces are initialized I needed to create a new static reference to the `torch._C._nn` namespace (`THPNNVariableFunctions`). I'm not sure if that is the best way to do this. In principle I could import these namespaces in each kernel and avoid the global variable but that would have a runtime cost.

I added `__torch_function__` support to the Python functions in `torch.nn.functional` following the approach in https://github.com/pytorch/pytorch/issues/32194.

I re-enabled the test that checks if all functions in the `torch` namespace are explicitly tested for `__torch_function__` support. I also generalized the check to work for `torch.functional` and `torch.nn.functional` as well. This test was explicitly disabled in https://github.com/pytorch/pytorch/issues/30730 and I'm happy to disable it again if you think that's appropriate. I figured now was as good a time as any to try to re-enable it.

Finally I adjusted the existing torch API tests to suppress deprecation warnings and add keyword arguments used by some of the code in `torch.nn.functional` that were missed when I originally added the tests in https://github.com/pytorch/pytorch/issues/27064.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32799

Differential Revision: D19956809

Pulled By: ezyang

fbshipit-source-id: 40d34e0109cc4b9f3ef62f409d2d35a1d84e3d22
2020-02-21 08:38:37 -08:00
6cec555926 Replace AT_CHECK with TORCH_CHECK in torch/csrc/jit/pybind_utils.h (#33524)
Summary:
This is generating a considerable amount of warning, due to the fact
that the header file is included in multiple places.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33524

Differential Revision: D20006604

Pulled By: ezyang

fbshipit-source-id: 0885cd2a708679ba5eeabb172366eb4c5a3bbef4
2020-02-21 08:38:32 -08:00
90f4c5695e Revert "Revert D19975411: Remove special case codegen for tril_indices/triu_indices." (#33572)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33572

This reverts commit 687a7e4a2566861c53c8fb53a80b198465168b38.

Original PR #33305

Reland with BC tests whitelisted. See https://github.com/pytorch/pytorch/issues/33580 for reasoning why this change is not actually BC breaking.

Test Plan: Imported from OSS

Differential Revision: D20011011

Pulled By: ezyang

fbshipit-source-id: 116374efc93af12b8ad738a0989d6f0daa9569e2
2020-02-21 08:36:32 -08:00
e2a9ea0f72 Ensure that lambda is no less than zero in softshrink (#33201)
Summary:
Softshrink is ill-defined when `lambda < 0`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33201

Differential Revision: D19899571

Pulled By: ezyang

fbshipit-source-id: ac0dd8edea3435810a76a3a88152f83a024c7859
2020-02-21 08:34:06 -08:00
a6a72ac68f Fix all occurrences of C416. (#33429)
Summary:
C416: Unnecessary (list/set) comprehension - rewrite using list/set().

See https://pypi.org/project/flake8-comprehensions/
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33429

Differential Revision: D19972858

Pulled By: ezyang

fbshipit-source-id: faac042a94c59d737bd5ae983121a0a029346e23
2020-02-21 08:32:22 -08:00
4588f49f68 Kill cudaDeviceAllocator in THCState (#33380)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/33380

Differential Revision: D19973151

Pulled By: ezyang

fbshipit-source-id: 41634c43b28ca723e39e761afd32e5015e122368
2020-02-21 08:06:11 -08:00
a943b0518b strict check for a device type in Fuser (#33025)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/33025

Differential Revision: D19975873

Pulled By: Krovatkin

fbshipit-source-id: 57f160bec9e4285dda63611f12665264754aac32
2020-02-20 23:53:27 -08:00
e8a03438cc Make TestCuda.test_memory_stats more robust (#33575)
Summary:
IIUC Python does not guarantee when an object is garbage collected. So it is possible that, some other test running before `TestCuda.test_memory_stats` creates object which is only garbage collected during  `TestCuda.test_memory_stats`, causing mem stats to change and causing this test to fail. This kind of failure is very hard to debug (it took me and mcarilli and ptrblck quite a while to figure out what is happening), and it is the root cause of mcarilli's gradient scaling PR https://github.com/pytorch/pytorch/pull/26512 failing on Windows.

cc: csarofeen
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33575

Differential Revision: D20009260

Pulled By: ngimel

fbshipit-source-id: 62f2716aefac3aa6c7d1898aa8a78e6b8aa3075a
2020-02-20 21:02:55 -08:00
009293ec5c [pytorch][size] remove unused SparseCPUType from mobile build (#33517)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33517

I don't think any mobile model uses SparseCPU backend yet so we can skip
generating dispatch code for this backend type.

This will help reduce mobile code size with dynamic dispatch turned on,
roughly ~100K for uncompressed iOS: D19616007 +413K v.s. D19616016 +319K.

It probably doesn't affect much static dispatch build size as the unused
static dispatch methods will be stripped by linker in the end.
ghstack-source-id: 98615810

Test Plan: - CI & BuildSizeBot

Reviewed By: linbinyu

Differential Revision: D19978633

fbshipit-source-id: 27bf6ada2ba98482084cf23724cf400b538b0a03
2020-02-20 20:12:36 -08:00
ac9b40164d Use cheaper check in isTensorList (#33528)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/33528

Test Plan: Imported from OSS

Reviewed By: ajyu

Differential Revision: D19989166

Pulled By: zdevito

fbshipit-source-id: b0c484e037ca48226ed4d9204a06982e0c627ff0
2020-02-20 20:10:51 -08:00
d3d975cbf6 Updating submodules
Summary:
GitHub commits:

a16cb11a77
d92f4e3e1e
d021412065
a7c056b5b4
ac6d53d1c9
d75ce0a8ae
622abbcbb3
e1f7368d51
dc2e654b75
50c9e44631

Test Plan: n/a

Reviewed By: 2d2d2d2d2d

fbshipit-source-id: 452151a75a70f744cba309b2700f274275d476bd
2020-02-20 18:25:57 -08:00
9266bde970 [pytorch] Minor: add GIL assert to PythonRpcHandler::handleExceptionGILHeld (#33557)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33557

We should add GIL asserts in some places to keep assumptions documented.
This just adds one in an exception codepath as a placeholder for more.

This change also moves a #define from a .h to the .cpp to reduce scope.
ghstack-source-id: 98673532

Test Plan: buck test mode/dev-nosan caffe2/test/...

Differential Revision: D20005387

fbshipit-source-id: b7eff54a6f1dd69d199f8ca05cdb3001c50b37c4
2020-02-20 18:15:44 -08:00
0bde610c14 Re-sync with internal repository (#33591) 2020-02-20 16:46:16 -08:00
3498c000e2 [TVM] Remove dynamic batch size dispatching (#33584)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33584

- Remove dynamic batch size dispatching
- Set caffe2_tvm_min_ops to 8
- Set caffe2_tvm_profiling_based_jit to false
- Rename some variable names

Test Plan: buck test caffe2/caffe2/fb/tvm:test_tvm_transform

Reviewed By: yinghai

Differential Revision: D19850620

fbshipit-source-id: 2ec9bbd9fa72f953e79f3e27609ad00d4e135710
2020-02-20 16:13:29 -08:00
faa800eb5b [JIT] remove inline everything jitter skip (#33468)
Summary:
The `not inline_everything` check was causing the jitter check to be skipped whenever we emitted a function. thanks SplitInfinity for pointing this out.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33468

Differential Revision: D19975934

Pulled By: eellison

fbshipit-source-id: 03faf8d2fd93f148100d8cf49cb67b8e15cf1f04
2020-02-20 15:58:25 -08:00
c882425c24 Add 64-bit indexing support to THC index reductions (#33405)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/32863, (together with https://github.com/pytorch/pytorch/issues/33310 for the `TensorIterator` reductions)

This adds 64-bit indexed kernels for `THC_reduceDimIndex` and uses `THCTensor_canUse32BitIndexMath` to switch between the two at runtime.

I have a test for this locally but haven't included it here because `max` is much slower than `argmax`. To the point where the test takes several minutes to call max on just one `2**32` element tensor. That seems excessive, even for a slow test but I can push it if preferred.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33405

Differential Revision: D20010769

Pulled By: ezyang

fbshipit-source-id: a8a86f662598d5fade4d90448436418422c699a3
2020-02-20 15:20:14 -08:00
23846d5a38 [caffe2] use Clang identification macro in various places (#33574)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33574

Sprinkle with Clang identification macro places that otherwise would cause build errors when Clang is used to drive the CUDA compilation.

Note: `__clang__` is defined when either Clang is used as host compiler by NVCC or when Clang drives the compilation. `__CUDA__` is defined only for the latter case.

Test Plan:
```lang=bash
buck build mode/opt -c fbcode.cuda_use_clang=true //fblearner/flow/projects/dper:workflow
buck build mode/opt //fblearner/flow/projects/dper:workflow
```

Reviewed By: BIT-silence

Differential Revision: D20007440

fbshipit-source-id: 53caa70695b99461a3910d41dc71a9f6d0728a75
2020-02-20 15:16:11 -08:00
5782758b54 Add instructions and operators for new bytecode format of PyText model (#33555)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33555

A quick fix for the PyText model (in internal production) on the new bytecode format.

Test Plan: Imported from OSS

Differential Revision: D20008266

Pulled By: iseeyuan

fbshipit-source-id: 1916bd0bf41093898713c567c7f6fa546b9ea440
2020-02-20 15:05:37 -08:00
108fc78395 [caffe2] fix invalid % escape in inline assembly strings (#33554)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33554

NVCC/GCC accepts the existing syntax, but not Clang which requires a proper escape. Here `%laneid` is one of the many registers that CUDA's pseudo-asm provides [1]. And using the extra `%` doesn't change the semantics, as PTX expects `%laneid` value after it's processed by the asm tool.

1. https://docs.nvidia.com/cuda/parallel-thread-execution/index.html

Test Plan:
```lang=bash
buck build mode/opt -c fbcode.cuda_use_clang=true //fblearner/flow/projects/dper:workflow
buck build mode/opt //fblearner/flow/projects/dper:workflow

Reviewed By: bddppq

Differential Revision: D20003621

fbshipit-source-id: 8e550e55a3455925e7bd92c6df3e504b5d38c2dc
2020-02-20 14:31:52 -08:00
e5cf7afd0a torch.tensor can infer complex dtype now (#33361)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/33361

Test Plan: Imported from OSS

Differential Revision: D19943477

Pulled By: anjali411

fbshipit-source-id: ff6d7d2a6fdb6c58390f33bdd8be2f3fa182518b
2020-02-20 14:24:15 -08:00
13e4ee7883 Added tensor.is_complex(), is_complex and dtype.is_complex py binding, tensor printing, and dixed the scalar type returned for complex float (#33268)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/33268

Test Plan: Imported from OSS

Differential Revision: D19907698

Pulled By: anjali411

fbshipit-source-id: c3ce2e99fc09da91a90a8fb94e5525a00bb23703
2020-02-20 13:38:01 -08:00
36d724c963 run peephole to do profile-based optimizations (#33337)
Summary:
We need to run a peephole before constant propagation in the profiling pipeline, so we fold `prim::shape` for inputs with complete tensor types.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33337

Differential Revision: D19905624

Pulled By: Krovatkin

fbshipit-source-id: 80fff067941556053847ddc7afe0fd1c7a89a3ba
2020-02-20 12:39:22 -08:00
1a25747342 Check for consistent devices in at::where (#33432)
Summary:
Changelog:
- Add a check to ensure that all inputs to `where` lie on the same device
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33432

Test Plan:
- Added test_where_invalid_device

Fixes https://github.com/pytorch/pytorch/issues/33422

Differential Revision: D19981115

Pulled By: VitalyFedyunin

fbshipit-source-id: 745896927edb53f61f3dd48ba9e1e6cd10d35434
2020-02-20 12:18:01 -08:00
71225ecc8c Revert D20006312: Revert D19975410: Update documentation on why _cudnn_init_dropout_state looks the way it is.
Test Plan: revert-hammer

Differential Revision:
D20006312

Original commit changeset: 4d4cc8ae78ad

fbshipit-source-id: 4bd4b9d1331dc97f5b83e0df491be5fd0a11214a
2020-02-20 12:05:13 -08:00
687a7e4a25 Revert D19975411: Remove special case codegen for tril_indices/triu_indices.
Test Plan: revert-hammer

Differential Revision:
D19975411

Original commit changeset: 996598759bed

fbshipit-source-id: 6bdb4b8f903e13815fc146e6f3260e5bb04c1045
2020-02-20 11:29:53 -08:00
d19a50bf27 Add missing weight_decay parameter validation for Adam and AdamW (#33126)
Summary:
Adam and AdamW are missing parameter validation for weight_decay. Other optimisers have this check present.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33126

Differential Revision: D19860366

Pulled By: vincentqb

fbshipit-source-id: 286d7dc90e2f4ccf6540638286d2fe17939648fc
2020-02-20 11:11:51 -08:00
cdf381c967 Fix LambdaLR scheduler side effects (#32848)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/32756
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32848

Differential Revision: D19859736

Pulled By: vincentqb

fbshipit-source-id: 43b3cbb2b6bed208c75aad37aebc2a8a9565fe0d
2020-02-20 11:09:56 -08:00
3233033a17 Revert D19975410: Update documentation on why _cudnn_init_dropout_state looks the way it is.
Test Plan: revert-hammer

Differential Revision:
D19975410

Original commit changeset: eb729870c2d2

fbshipit-source-id: 4d4cc8ae78ad18751c126b93d82932ac2732f1b5
2020-02-20 11:01:44 -08:00
718c538ff9 Add ability to enable/disable MIOpen at runtime (#33118)
Summary:
1. Set `torch._C.has_cudnn` to `True` for ROCm
2. Make MIOpen invocations respect value of `cudnn_enabled` or `at::globalContext().userEnabledCuDNN()`
3. `torch/backends/cudnn/__init__.py`: Add hip-specific changes (use "hide whitespace changes" option to view simpler diff)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33118

Differential Revision: D19977719

Pulled By: bddppq

fbshipit-source-id: 64d4dd1d78afcf96201360d85b8be5950f96dfad
2020-02-20 10:47:57 -08:00
01e1de8220 allow remote torchscript call to itself (#32990)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32990

right now remote torchscript call can not call to itself, this diff is to support this in the same way as how is supported when calling remote python call to itself
ghstack-source-id: 98599082

Test Plan: unit test

Differential Revision: D19731910

fbshipit-source-id: 6495db68c3eaa58812aa0c5c1e72e8b6057dc5c4
2020-02-20 09:44:10 -08:00
a9e4448dff Update documentation on why _cudnn_init_dropout_state looks the way it is. (#33347)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33347

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Differential Revision: D19975410

Pulled By: ezyang

fbshipit-source-id: eb729870c2d279d7d9ca43c92e514fe38dedb06d
2020-02-20 09:36:26 -08:00
196fda5a79 Remove special case codegen for tril_indices/triu_indices. (#33305)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33305

The current TensorOptions code is written to exactly extract out
TensorOptions based on exact struct match, including default arguments.
That meant that tril_indices/triu_indices which had a different
default argument didn't match, and thus needed a special case.

I resolve this special case by instead replacing the explicit long
default argument with a None default argument, and then adjusting
the actual implementations to select the correct dtype when none
was specified.  I think the general rule I'm following here is that
it is always acceptable to replace an explicit default argument,
with a None argument (assuming the backend will compute it appropriately);
the documentation gets modestly worse, but everything that was
previously expressible continues to be expressible.  Maybe later
we should switch the default argument back to long, but for now
the simplification in code is worth it.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Differential Revision: D19975411

Pulled By: ezyang

fbshipit-source-id: 996598759bed9e8d54fe61e19354ad038ed0e852
2020-02-20 09:34:28 -08:00
ffe327f7d9 Revert "Disable flaky test TestCppExtensionAOT.test_cuda_extension in… (#33404)
Summary:
… Windows CI (https://github.com/pytorch/pytorch/issues/33282)"

This reverts commit 5b922918d023126ad1f468c68577c9b599ad202d.

Fixes https://github.com/pytorch/pytorch/issues/33270.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33404

Differential Revision: D19972594

Pulled By: ezyang

fbshipit-source-id: c8f67536fd6e4b7135171d621ad671b1b2a21fd4
2020-02-20 09:08:29 -08:00
05fb160048 Revert D19964089: [pytorch][PR] Allow vectorized gpu loop to have different argument types
Test Plan: revert-hammer

Differential Revision:
D19964089

Original commit changeset: a1e8e62d1ebc

fbshipit-source-id: fee9423d5924714f0e92eea712cde2d2163b3cf0
2020-02-20 08:19:21 -08:00
883b18ea70 Delete build_variables.bzl following configerator change.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
2020-02-20 10:26:49 -05:00
e95282ab28 [caffe2] make fused rowwise quant/dequant op work for N-dim tensors (#33426)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33426

Make 2/4/8-bit fused rowwise conversion operators more general to work for N-dim tensors

Test Plan: CI

Reviewed By: ellie-wen

Differential Revision: D19943136

fbshipit-source-id: 47008544dd7e1d11a346d34f35449e0fcc0e7ee0
2020-02-19 23:29:42 -08:00
bf0951d937 Updating ONNX checker logic. (#33522)
Summary:
We want to run ONNX checker only when selected operator type is ONNX, and nowhere else. This PR updates the logic in the exporter.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33522

Reviewed By: hl475

Differential Revision: D19983954

Pulled By: houseroad

fbshipit-source-id: 15db726321637a96fa110051cc54e9833e201133
2020-02-19 19:30:29 -08:00
1fe635be3c Allow vectorized gpu loop to have different argument types (#33222)
Summary:
Although currently the only user of GPU loops that has args with different dtypes is `where`, it sounds strange to restrict the args to have the same dtype. Allowing args to have different dtypes also makes it possible for me to clean up legacy code by reusing current code to implement unrolled GPU loop for non-contiguous tensors.

The stack storage of `elementwise_kernel_helper` is changed from `arg_t args[nt][arity]` to `traits:: ArgsTuple args[nt]`. Due to this change, we can no longer get element by `operator[]`, but instead we should use `std::get`. As a result, we can no longer unroll the loop wrt arity using pragma, but we have to
create a `static_unroll` to make use of template meta-programming to do the same job.

A good side effect of this change is, `invoke_with_array` is no longer needed and can be replaced with already existing `c10::guts::apply`. And we don't need the `namespace arg_type` workaround either. This makes the code less ugly.

The same approach might also work for ROCm loops, but I didn't change anything on ROCm in this PR, because I don't want potential compilation error or perf regression to delay this PR. But after this gets merged, I will try on ROCm and send a separate PR to make the code less diverge if the same approach trivially applies (trivially apply means a mindless copy-paste doesn't introduce unexpected compilation error or perf regression).

Assembly (https://github.com/zasdfgbnm/things/blob/master/2020Q1/disassembly-elementwise-vec.ipynb#33222):
```
**Symbol:**
void at::native::modern::elementwise_kernel<4, 64, 4, at::native::add_kernel_cuda(at::TensorIterator&, c10::Scalar)::{lambda()https://github.com/pytorch/pytorch/issues/1}::operator()() const::{lambda()https://github.com/pytorch/pytorch/issues/4}::operator()() const::{lambda(float, float)https://github.com/pytorch/pytorch/issues/1}, at::detail::Array<char*, 3> >(int, at::native::add_kernel_cuda(at::TensorIterator&, c10::Scalar)::{lambda()https://github.com/pytorch/pytorch/issues/1}::operator()() const::{lambda()https://github.com/pytorch/pytorch/issues/4}::operator()() const::{lambda(float, float)https://github.com/pytorch/pytorch/issues/1}, at::detail::Array<char*, 3>)

**ASM:**

	.section	.text._ZN2at6native6modern18elementwise_kernelILi4ELi64ELi4EZZZNS0_15add_kernel_cudaERNS_14TensorIteratorEN3c106ScalarEENKUlvE_clEvENKUlvE2_clEvEUlffE_NS_6detail5ArrayIPcLi3EEEEEviT2_T3_,"ax",progbits
	.sectioninfo	@"SHI_REGISTERS=20"
	.align	128
        .global         _ZN2at6native6modern18elementwise_kernelILi4ELi64ELi4EZZZNS0_15add_kernel_cudaERNS_14TensorIteratorEN3c106ScalarEENKUlvE_clEvENKUlvE2_clEvEUlffE_NS_6detail5ArrayIPcLi3EEEEEviT2_T3_
        .type           _ZN2at6native6modern18elementwise_kernelILi4ELi64ELi4EZZZNS0_15add_kernel_cudaERNS_14TensorIteratorEN3c106ScalarEENKUlvE_clEvENKUlvE2_clEvEUlffE_NS_6detail5ArrayIPcLi3EEEEEviT2_T3_,function
        .size           _ZN2at6native6modern18elementwise_kernelILi4ELi64ELi4EZZZNS0_15add_kernel_cudaERNS_14TensorIteratorEN3c106ScalarEENKUlvE_clEvENKUlvE2_clEvEUlffE_NS_6detail5ArrayIPcLi3EEEEEviT2_T3_,(.L_40520 - _ZN2at6native6modern18elementwise_kernelILi4ELi64ELi4EZZZNS0_15add_kernel_cudaERNS_14TensorIteratorEN3c106ScalarEENKUlvE_clEvENKUlvE2_clEvEUlffE_NS_6detail5ArrayIPcLi3EEEEEviT2_T3_)
        .other          _ZN2at6native6modern18elementwise_kernelILi4ELi64ELi4EZZZNS0_15add_kernel_cudaERNS_14TensorIteratorEN3c106ScalarEENKUlvE_clEvENKUlvE2_clEvEUlffE_NS_6detail5ArrayIPcLi3EEEEEviT2_T3_,@"STO_CUDA_ENTRY STV_DEFAULT"
_ZN2at6native6modern18elementwise_kernelILi4ELi64ELi4EZZZNS0_15add_kernel_cudaERNS_14TensorIteratorEN3c106ScalarEENKUlvE_clEvENKUlvE2_clEvEUlffE_NS_6detail5ArrayIPcLi3EEEEEviT2_T3_:
.text._ZN2at6native6modern18elementwise_kernelILi4ELi64ELi4EZZZNS0_15add_kernel_cudaERNS_14TensorIteratorEN3c106ScalarEENKUlvE_clEvENKUlvE2_clEvEUlffE_NS_6detail5ArrayIPcLi3EEEEEviT2_T3_:
	//## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/CUDALoops.cuh", line 253
        /*0000*/                   IMAD.MOV.U32 R1, RZ, RZ, c[0x0][0x28] ;
        /*0010*/              @!PT SHFL.IDX PT, RZ, RZ, RZ, RZ ;
        /*0020*/                   S2R R9, SR_CTAID.X ;
	//## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/MemoryAccess.cuh", line 39
        /*0030*/                   S2R R0, SR_TID.X ;
	//## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/CUDALoops.cuh", line 253
        /*0040*/                   IMAD.SHL.U32 R9, R9, 0x100, RZ ;
        /*0050*/                   IADD3 R5, -R9, c[0x0][0x160], RZ ;
	//## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/CUDALoops.cuh", line 227
        /*0060*/                   SHF.R.S32.HI R17, RZ, 0x1f, R9 ;
	//## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/CUDALoops.cuh", line 255
        /*0070*/                   ISETP.GE.AND P0, PT, R5, 0x100, PT ;
        /*0080*/              @!P0 BRA `(.L_2919) ;
	//## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/CUDALoops.cuh", line 227
        /*0090*/                   IMAD.SHL.U32 R12, R9.reuse, 0x4, RZ ;
        /*00a0*/                   SHF.L.U64.HI R17, R9, 0x2, R17 ;
	//## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/CUDALoops.cuh", line 229
        /*00b0*/                   IADD3 R8, P0, R12.reuse, c[0x0][0x188], RZ ;
        /*00c0*/                   IADD3 R2, P1, R12, c[0x0][0x190], RZ ;
        /*00d0*/                   IADD3.X R9, R17.reuse, c[0x0][0x18c], RZ, P0, !PT ;
        /*00e0*/                   IADD3.X R3, R17, c[0x0][0x194], RZ, P1, !PT ;
	//## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/MemoryAccess.cuh", line 82
        /*00f0*/                   IMAD.WIDE R8, R0, 0x10, R8 ;
        /*0100*/                   IMAD.WIDE R2, R0, 0x10, R2 ;
        /*0110*/                   LDG.E.128.SYS R8, [R8] ;
        /*0120*/                   LDG.E.128.SYS R4, [R2] ;
	//## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/CUDALoops.cuh", line 227
        /*0130*/                   IADD3 R12, P0, R12, c[0x0][0x180], RZ ;
        /*0140*/                   IADD3.X R13, R17, c[0x0][0x184], RZ, P0, !PT ;
	//## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/MemoryAccess.cuh", line 102
        /*0150*/                   IMAD.WIDE R12, R0, 0x10, R12 ;
	//## File "/usr/include/c++/8/tuple", line 1315
        /*0160*/                   FFMA R7, R7, c[0x0][0x168], R11 ;
        /*0170*/                   FFMA R6, R6, c[0x0][0x168], R10 ;
        /*0180*/                   FFMA R5, R5, c[0x0][0x168], R9 ;
        /*0190*/                   FFMA R4, R4, c[0x0][0x168], R8 ;
	//## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/MemoryAccess.cuh", line 102
        /*01a0*/                   STG.E.128.SYS [R12], R4 ;
        /*01b0*/                   EXIT ;
.L_2919:
	//## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/MemoryAccess.cuh", line 42
        /*01c0*/                   ISETP.GE.AND P0, PT, R0, R5, PT ;
        /*01d0*/                   BMOV.32.CLEAR RZ, B0 ;
        /*01e0*/                   BSSY B0, `(.L_2920) ;
        /*01f0*/                   IMAD.MOV.U32 R4, RZ, RZ, RZ ;
        /*0200*/                   CS2R R6, SRZ ;
        /*0210*/                   IMAD.MOV.U32 R8, RZ, RZ, RZ ;
        /*0220*/                   IMAD.MOV.U32 R10, RZ, RZ, RZ ;
        /*0230*/               P0 BRA `(.L_2921) ;
	//## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/MemoryAccess.cuh", line 45
        /*0240*/                   IADD3 R3, P1, R9, R0, RZ ;
        /*0250*/                   LEA.HI.X.SX32 R6, R0, R17, 0x1, P1 ;
        /*0260*/                   LEA R2, P1, R3, c[0x0][0x188], 0x2 ;
        /*0270*/                   LEA.HI.X R3, R3, c[0x0][0x18c], R6, 0x2, P1 ;
        /*0280*/                   LDG.E.SYS R10, [R2] ;
	//## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/MemoryAccess.cuh", line 46
        /*0290*/                   IADD3 R6, R0, 0x40, RZ ;
	//## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/MemoryAccess.cuh", line 42
        /*02a0*/                   ISETP.GE.AND P1, PT, R6, R5, PT ;
        /*02b0*/               P1 BRA `(.L_2922) ;
	//## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/MemoryAccess.cuh", line 45
        /*02c0*/                   LDG.E.SYS R6, [R2+0x100] ;
	//## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/MemoryAccess.cuh", line 46
        /*02d0*/                   IADD3 R8, R0, 0x80, RZ ;
	//## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/MemoryAccess.cuh", line 42
        /*02e0*/                   ISETP.GE.AND P1, PT, R8, R5, PT ;
        /*02f0*/               P1 BRA `(.L_2923) ;
	//## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/MemoryAccess.cuh", line 46
        /*0300*/                   IADD3 R8, R0, 0xc0, RZ ;
	//## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/MemoryAccess.cuh", line 42
        /*0310*/                   ISETP.GE.AND P1, PT, R8, R5, PT ;
	//## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/MemoryAccess.cuh", line 45
        /*0320*/                   LDG.E.SYS R8, [R2+0x200] ;
        /*0330*/              @!P1 LDG.E.SYS R7, [R2+0x300] ;
	//## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/MemoryAccess.cuh", line 102
        /*0340*/               P1 IMAD.MOV.U32 R7, RZ, RZ, RZ ;
        /*0350*/                   BRA `(.L_2921) ;
.L_2923:
        /*0360*/                   IMAD.MOV.U32 R7, RZ, RZ, RZ ;
        /*0370*/                   IMAD.MOV.U32 R8, RZ, RZ, RZ ;
        /*0380*/                   BRA `(.L_2921) ;
.L_2922:
        /*0390*/                   CS2R R6, SRZ ;
        /*03a0*/                   IMAD.MOV.U32 R8, RZ, RZ, RZ ;
.L_2921:
        /*03b0*/                   BSYNC B0 ;
.L_2920:
        /*03c0*/                   BMOV.32.CLEAR RZ, B0 ;
        /*03d0*/                   BSSY B0, `(.L_2924) ;
        /*03e0*/               P0 BRA `(.L_2925) ;
	//## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/MemoryAccess.cuh", line 45
        /*03f0*/                   IADD3 R3, P1, R9, R0, RZ ;
        /*0400*/                   LEA.HI.X.SX32 R12, R0, R17, 0x1, P1 ;
        /*0410*/                   LEA R2, P1, R3, c[0x0][0x190], 0x2 ;
        /*0420*/                   LEA.HI.X R3, R3, c[0x0][0x194], R12, 0x2, P1 ;
        /*0430*/                   LDG.E.SYS R11, [R2] ;
	//## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/MemoryAccess.cuh", line 46
        /*0440*/                   IADD3 R12, R0, 0x40, RZ ;
	//## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/MemoryAccess.cuh", line 42
        /*0450*/                   ISETP.GE.AND P1, PT, R12, R5, PT ;
        /*0460*/               P1 BRA `(.L_2926) ;
	//## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/MemoryAccess.cuh", line 45
        /*0470*/                   LDG.E.SYS R13, [R2+0x100] ;
	//## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/MemoryAccess.cuh", line 46
        /*0480*/                   IADD3 R12, R0, 0x80, RZ ;
	//## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/MemoryAccess.cuh", line 42
        /*0490*/                   ISETP.GE.AND P1, PT, R12, R5, PT ;
        /*04a0*/               P1 BRA `(.L_2927) ;
	//## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/MemoryAccess.cuh", line 45
        /*04b0*/                   LDG.E.SYS R15, [R2+0x200] ;
	//## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/MemoryAccess.cuh", line 46
        /*04c0*/                   IADD3 R12, R0, 0xc0, RZ ;
	//## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/MemoryAccess.cuh", line 42
        /*04d0*/                   ISETP.GE.AND P1, PT, R12, R5, PT ;
        /*04e0*/               P1 BRA `(.L_2928) ;
	//## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/MemoryAccess.cuh", line 45
        /*04f0*/                   LDG.E.SYS R4, [R2+0x300] ;
        /*0500*/                   BRA `(.L_2928) ;
.L_2927:
        /*0510*/                   IMAD.MOV.U32 R15, RZ, RZ, RZ ;
        /*0520*/                   BRA `(.L_2928) ;
.L_2926:
        /*0530*/                   IMAD.MOV.U32 R15, RZ, RZ, RZ ;
        /*0540*/                   IMAD.MOV.U32 R13, RZ, RZ, RZ ;
        /*0550*/                   BRA `(.L_2928) ;
.L_2925:
        /*0560*/                   IMAD.MOV.U32 R15, RZ, RZ, RZ ;
        /*0570*/                   IMAD.MOV.U32 R13, RZ, RZ, RZ ;
        /*0580*/                   IMAD.MOV.U32 R11, RZ, RZ, RZ ;
.L_2928:
        /*0590*/                   BSYNC B0 ;
.L_2924:
	//## File "/usr/include/c++/8/tuple", line 1315
        /*05a0*/               P0 EXIT ;
	//## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/MemoryAccess.cuh", line 58
        /*05b0*/                   IADD3 R9, P0, R9, R0, RZ ;
	//## File "/usr/include/c++/8/tuple", line 1315
        /*05c0*/                   FFMA R11, R11, c[0x0][0x168], R10 ;
	//## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/MemoryAccess.cuh", line 59
        /*05d0*/                   IADD3 R14, R0, 0x40, RZ ;
	//## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/MemoryAccess.cuh", line 58
        /*05e0*/                   LEA.HI.X.SX32 R12, R0, R17, 0x1, P0 ;
        /*05f0*/                   LEA R2, P0, R9.reuse, c[0x0][0x180], 0x2 ;
	//## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/MemoryAccess.cuh", line 55
        /*0600*/                   ISETP.GE.AND P1, PT, R14, R5, PT ;
	//## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/MemoryAccess.cuh", line 58
        /*0610*/                   LEA.HI.X R3, R9, c[0x0][0x184], R12, 0x2, P0 ;
        /*0620*/                   STG.E.SYS [R2], R11 ;
	//## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/MemoryAccess.cuh", line 55
        /*0630*/               P1 EXIT ;
	//## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/MemoryAccess.cuh", line 59
        /*0640*/                   IADD3 R10, R0, 0x80, RZ ;
	//## File "/usr/include/c++/8/tuple", line 1315
        /*0650*/                   FFMA R13, R13, c[0x0][0x168], R6 ;
	//## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/MemoryAccess.cuh", line 55
        /*0660*/                   ISETP.GE.AND P0, PT, R10, R5, PT ;
	//## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/MemoryAccess.cuh", line 58
        /*0670*/                   STG.E.SYS [R2+0x100], R13 ;
	//## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/MemoryAccess.cuh", line 55
        /*0680*/               P0 EXIT ;
	//## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/MemoryAccess.cuh", line 59
        /*0690*/                   IADD3 R0, R0, 0xc0, RZ ;
	//## File "/usr/include/c++/8/tuple", line 1315
        /*06a0*/                   FFMA R15, R15, c[0x0][0x168], R8 ;
        /*06b0*/                   FFMA R7, R4, c[0x0][0x168], R7 ;
	//## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/MemoryAccess.cuh", line 55
        /*06c0*/                   ISETP.GE.AND P0, PT, R0, R5, PT ;
	//## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/MemoryAccess.cuh", line 58
        /*06d0*/                   STG.E.SYS [R2+0x200], R15 ;
	//## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/MemoryAccess.cuh", line 55
        /*06e0*/               P0 EXIT ;
	//## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/MemoryAccess.cuh", line 58
        /*06f0*/                   STG.E.SYS [R2+0x300], R7 ;
	//## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/CUDALoops.cuh", line 260
        /*0700*/                   EXIT ;
.L_2929:
        /*0710*/                   BRA `(.L_2929);
        /*0720*/                   NOP;
        /*0730*/                   NOP;
        /*0740*/                   NOP;
        /*0750*/                   NOP;
        /*0760*/                   NOP;
        /*0770*/                   NOP;
.L_40520:
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33222

Differential Revision: D19964089

Pulled By: ngimel

fbshipit-source-id: a1e8e62d1ebcc67fb49f00d87c02bcdd13194024
2020-02-19 18:41:27 -08:00
81394581a3 [Caffe2][ThreadPool] Make sure numThreads does not exceed the number of big cores (#33523)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33523

When using `ThreadPool::setNumThreads` to set the number of threads, it should not exceed the number of big cores. Otherwise, the performance could degrade significantly.

Test Plan:
```
cd ~/fbsource/xplat
buck test caffe2:caffe2_testAndroid
```

Reviewed By: dreiss

Differential Revision: D19779267

fbshipit-source-id: 4e980e8a0ccc2f37e1c8ed16e2f4651d72924dbd
2020-02-19 18:24:24 -08:00
602ef0d9d0 [WIP] migrate scatter_ to ATen CPU (+multithreading, nondeterministic) (#33139)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/24757, partially https://github.com/pytorch/pytorch/issues/33094. Uses fix introduces in https://github.com/pytorch/pytorch/issues/33108 to avoid regressions for some compilers.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33139

Differential Revision: D19882462

Pulled By: ngimel

fbshipit-source-id: 5016f186a4aadc3cc32edcfd9abdea11786f27e9
2020-02-19 18:17:37 -08:00
6cb9e6b015 Back out "Revert D19871946: [distributed] pass in timeout to TCP store when initializing" (#33434)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33434

Reland of https://github.com/pytorch/pytorch/pull/33325, since the
unit test was flaky and failed on land.
To ensure that the test is not flaky, I bumped the timeout so the rendezvous
does not timeout (timing out the rendezvous in 1s led to the flakiness). I also
generalized our mechanism for retrying on errors to include retrying on errors
due to timeout in rendezvous.
ghstack-source-id: 98558377

Test Plan: Added UT test_tcp_store_timeout_set

Differential Revision: D19935390

fbshipit-source-id: 56ccf8c333dd2f954a33614d35cd1642d4e9473a
2020-02-19 17:17:17 -08:00
ecb05f12c3 Support broadcast for quantized mul kernel (#30442)
Summary:
Since the tensor iterator supports the broadcast, we will just remove the assertion on input shapes.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30442

Differential Revision: D19976562

Pulled By: lly-zero-one

fbshipit-source-id: 91b27fc8b2570f29d110c6df26eacdd16f587b9f
2020-02-19 16:52:31 -08:00
ea514c819a Make slow_conv_transpose2d_backward tensors contiguous (#33462)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/33462

Test Plan: Imported from OSS

Differential Revision: D19956516

Pulled By: VitalyFedyunin

fbshipit-source-id: 4fa9dcba0dd02b891ab36e6ecee8fc59e049c15c
2020-02-19 16:44:14 -08:00
e5a02aa2fe [caffe2] simplify relative error expr (#32999)
Summary:
simplify relative error expr
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32999

Differential Revision: D19739382

Pulled By: jerryzh168

fbshipit-source-id: 95e0c68f6d9cb6708f400cc1cdb311af83b0621e
2020-02-19 16:35:44 -08:00
bd3c6e8e91 avoid large vector copy when query per_channel q_params (#31040)
Summary:
The quantizer use std::vector to save per_channel scales and zero_points, but when query scales(zero_points), it requires to return tensor. These lead to use std::vector to initialize tensors and it dose cost lots of time. So I change quantizer to save per_channel scales and zero_points by using tensor directly.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31040

Differential Revision: D19701070

Pulled By: jerryzh168

fbshipit-source-id: 9043f16c44b74dd8289b8474e540171765a7f92a
2020-02-19 16:24:24 -08:00
8527ba8b70 [jit] Add None parameter as parameter instead of attributes (#32964)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32964

att

Test Plan:
.

Imported from OSS

Differential Revision: D19913188

fbshipit-source-id: 9cdd93cbaf9892f4311656c786637765a675a68c
2020-02-19 16:06:56 -08:00
507f963aa6 [RPC Reliability] Enabled retries for RPCs with exponential backoff (#33365)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33365

This adds functionality for re-trying RPC's that are sent with the function sendWithRetries(). It adds RPC's that will potentially need to be retried to a sorted map that contains the timeout at which to retry the RPC and associated metadata. A separate thread iteratively removes the earliest retry-able RPC from the map, sleeps until the corresponding time point, re-tries the RPC, and adds to the map again with a future timeout.

GitHub Issue: https://github.com/pytorch/pytorch/issues/32124

Per the first 4 milestones, the following will be addressed in future PR's:
* enabling RPC Retries for RRef internal messages

Differential Revision: D19915694

fbshipit-source-id: 4a520e32d5084ebcf90e97fd9f26867115a35c0c
2020-02-19 15:59:29 -08:00
416413dec4 [jit] add inlined_graph method to ScriptFunctions (#33508)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33508

Ever since we switched to not inlining by default, some users have
complained since they relied on inlining occuring to, e.g. process the
graph with some other tool. Add an inlined_graph for convenience in
those cases.

Test Plan: Imported from OSS

Differential Revision: D19977638

Pulled By: suo

fbshipit-source-id: fe1fa92ff888959203d5d1995930d488b5f9e24c
2020-02-19 15:41:25 -08:00
5e80ca12bb [pt][fbgemm] Turn on USE_FBGEMM on Windows env (#297)
Summary:
Pull Request resolved: https://github.com/pytorch/FBGEMM/pull/297

Pull Request resolved: https://github.com/pytorch/pytorch/pull/33250

As Title says. FBGEMM has recently added the support for Windows.

ghstack-source-id: 97932881

Test Plan: CI

Reviewed By: jspark1105

Differential Revision: D19738268

fbshipit-source-id: e7f3c91f033018f6355edeaf6003bd2803119df4
2020-02-19 15:09:21 -08:00
cbf8657945 [jit] Fix ModuleDict type sharing (#33515)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33515

Previously, if we had a `ModuleDict` with the same value types but
different names for keys, they would share types under certain
conditions. This only happens for `ModuleDict`, because in other cases
a simple Python class check invalidates the class.

Test Plan: Imported from OSS

Differential Revision: D19978552

Pulled By: suo

fbshipit-source-id: f31b2af490064f89b70aa35f83ba740ddaf2a77a
2020-02-19 15:01:46 -08:00
8908b62fb2 Clean views created inside no_grad that are modified inplace (#32839)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32839

As mentioned in the updated comment in `variable.h`, this disambiguate code like:
```python
base = torch.rand(10, requires_grad=True)
with torch.no_grad():
    view = base[1]
view.copy_(var)
torch.autograd.grad(base.sum(), var)  # <- what should it return?
```
Given that there is no consensus of what should happen here (does the gradient flow through the view in the no_grad or not). This special case is detected and forbidden.
As mentionned in the error message:
- If you want it to be tracked: move both out of the no_grad
- If do not want them to be tracked, move both inside the no_grad

This implies that any custom Function that returns views does not allow inplace modification on its output. I'll add a PR to the stack to relax this to be a DeprecationWarning for now. And we will make it into an actual error for 1.6

This replaces https://github.com/pytorch/pytorch/pull/26607
cc sublee

Test Plan: Imported from OSS

Differential Revision: D19814114

Pulled By: albanD

fbshipit-source-id: ff2c9d97c8f876d9c31773a2170e37b06d88bed7
2020-02-19 14:55:53 -08:00
20c1e25832 Re-sync with internal repository (#33519) 2020-02-19 14:33:44 -08:00
1d9fcf8bd2 Correct documentation for torch.unsqueeze (#33478)
Summary:
"out" argument in torch.unsqueeze is not actually implemented, fixed documentation https://github.com/pytorch/pytorch/issues/29800
After: ![image](https://user-images.githubusercontent.com/33493903/74796371-6289ee00-5296-11ea-8493-e8c18ac63bdf.png)

Before: ![image](https://user-images.githubusercontent.com/33493903/74796444-96651380-5296-11ea-816c-2adacfa79e35.png)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33478

Differential Revision: D19978477

Pulled By: yf225

fbshipit-source-id: 42337326c1ec04975307366c94591ee32a11b091
2020-02-19 14:01:06 -08:00
62c953b348 Fix svd tests between devices. (#33470)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/33470

Differential Revision: D19974449

Pulled By: ailzhang

fbshipit-source-id: e456608fe95d270d822e786a5955cce7c746165c
2020-02-19 13:53:10 -08:00
a8bd1d24c9 [Documentation] cummin doc fix (#33492)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/33492

Differential Revision: D19976082

Pulled By: anjali411

fbshipit-source-id: c9f8f541783fded98b8aba54e293f824c926496e
2020-02-19 13:51:38 -08:00
d4e4513a64 [JIT] Add more ops to 'removableGuard' in guard elimination pass. (#33465)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/33465

Differential Revision: D19958385

Test Plan: Imported from OSS

Pulled By: ZolotukhinM

fbshipit-source-id: f89b6a2ead279b55af286072223fc9ea1b5fe3b3
2020-02-19 11:47:23 -08:00
07e5e42713 [jit][fix] Remove slot in parameter slot (#32846)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32846

att

Test Plan:
build/bin/test_jit

Imported from OSS

Differential Revision: D19844711

fbshipit-source-id: 3d29e5e97e97781f5dc00069827971baed52d76e
2020-02-19 11:15:15 -08:00
1e3664b6ef Remove c/pdist tests from _internal/common_utils.py (#33409)
Summary:
* remove brute_test from `torch/testing/_internal/common_utils.py`
* add these tests as internal tests to `test_torch.py`

CC ailzhang
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33409

Differential Revision: D19951729

Pulled By: ailzhang

fbshipit-source-id: b1126aaf26fa64a0f17cbb582dc8038b79cfe3eb
2020-02-19 10:27:30 -08:00
60339a38ed Fixes #33001 (#33456)
Summary:
This fixes https://github.com/pytorch/pytorch/issues/33001.

When subtracting 1 from a empty array, instead of being `-1` as seems to be expected in the later code (while loop), because `size()` seems to be unsigned, it becomes a very large number. This causes a segfault during the while loop later in the code where it tries to access a empty array.

This issue seemed to happen only on the pi with the following example code: `v = torch.FloatTensor(1, 135).fill_(0); v[0, [1]] += 2`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33456

Differential Revision: D19963711

Pulled By: ezyang

fbshipit-source-id: 1dbddd59a5df544cd7e025fc540c9efe2c4e19f4
2020-02-19 09:57:52 -08:00
165b1ad8e8 Kill THCState_getNumDevices (#33375)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/33375

Differential Revision: D19973163

Pulled By: ezyang

fbshipit-source-id: d8edede3a3ac5012e4208bb30b6e66d8a2d1019f
2020-02-19 09:52:40 -08:00
96e5dea9f4 Remove unused variable (#33484)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33484

att

Test Plan: unittests

Reviewed By: jfix71

Differential Revision: D19862090

fbshipit-source-id: c6a33604e2fc78fb90ae2b5fcc72421ee89a02aa
2020-02-19 08:51:56 -08:00
d7f00b1b45 Remove using declaration from widely-used header file. (#33293)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/33293

Test Plan: Imported from OSS

Differential Revision: D19904992

Pulled By: gchanan

fbshipit-source-id: b5ac76db2e5cdb422671c6c5424858e1d97c323e
2020-02-19 08:19:11 -08:00
a67691e508 Fix isnan for integral types in MSVC (#33483)
Summary:
Fixes https://github.com/pytorch/pytorch/pull/32537#discussion_r381077989.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33483

Differential Revision: D19970623

Pulled By: anjali411

fbshipit-source-id: 53502101822672a333ab5349d93b6e93f7ee4265
2020-02-19 08:13:03 -08:00
53ad596342 [jit] Remove `torch.jit._dump_trace (#33453)
Summary:
This was old code that isn't tested and is broken, it should have been
deleted in #24874

Pull Request resolved: https://github.com/pytorch/pytorch/pull/33453

Pulled By: driazati

Differential Revision: D19961403

fbshipit-source-id: 94c52360460194d279dad5b0ea756ee366f525e1
2020-02-19 07:49:44 -08:00
8b6a898d2b Updating submodules
Summary:
GitHub commits:

d9ead2de34

Test Plan: n/a

Reviewed By: 2d2d2d2d2d

fbshipit-source-id: 6c245f2a656d30b7baf8d0bff85a49090174c289
2020-02-19 05:09:56 -08:00
d13c1b8af8 [jit] de-optionalize SourceRange context (#32880)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32880

The PR below made it impossible to construct a SourceRange without a
context, so get rid of its optional-ness

Test Plan: Imported from OSS

Differential Revision: D19670923

Pulled By: suo

fbshipit-source-id: 05936fca2a3d5e613313ade9287b2210bc4a3ccd
2020-02-18 23:46:05 -08:00
d85c913bfd [jit] Delete the ErrorReport default constructor (#32879)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32879

An error report without a SourceRange context is bad, because it doesn't
tell the user where something happened. Delete the default constructor
to make it harder to create errors like this (you can still use a fake
SourceRange if you absolutely need to).

Also clean up the only case where the default constructor was used.

Test Plan: Imported from OSS

Differential Revision: D19670924

Pulled By: suo

fbshipit-source-id: 46888a86e5d32b84c8d6d52c0c8d70243722b14a
2020-02-18 23:44:32 -08:00
e9ac92a242 Make RPC message constructor actually move (#33440)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33440

The constructors make a copy without `std::move` in the initializer list.

Test Plan:
Confirmed manually that without this change, the `data()` pointer of
the vector changes. With this change it does not, as intended.

Reviewed By: mrshenli

Differential Revision: D19948685

fbshipit-source-id: ee4f22e29894b858ad86068722dc2f4651987517
2020-02-18 23:31:33 -08:00
d50305e2f3 Updating submodules
Summary:
GitHub commits:

7903fc3142
462eaef5fc
e2966a7507
09013ed8c4
df7e47c39b
f40e6d1dbf

Test Plan: n/a

Reviewed By: 2d2d2d2d2d

fbshipit-source-id: 37553007eb60438d5ddd9cb16f0edc24e4637c25
2020-02-18 23:27:08 -08:00
a5f01846c2 Kill THCState_getCurrentStream (#33376)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/33376

Differential Revision: D19964101

Pulled By: ngimel

fbshipit-source-id: d6b76327191a469f3a88a54d8ffe07121139ab16
2020-02-18 21:24:27 -08:00
96989a2a11 [ONNX] Adding ONNX large model export support in exporter (#33062)
Summary:
There are large models such as GPT2-large which cannot be exported with the current exporter because of the 2GB protobuf limit (e.g. see https://github.com/pytorch/pytorch/issues/19277). ONNX spec specifies a special format for large (> 2GB)  models. This PR adds support for exporting large models in ONNX large model format in the PyTorch-ONNX exporter.

This is the first PR for this feature that enables the end-to-end execution. Tests for large model export have been added. We may need follow-up PRs to refine this workflow based on user feedback.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33062

Reviewed By: hl475

Differential Revision: D19782292

Pulled By: houseroad

fbshipit-source-id: e972fcb066065cae6336aa91c03023d9c41c88bd
2020-02-18 20:51:43 -08:00
3ad59734d7 Add type annotation for bias in _ConvNd (#32885)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32885

Currently Tensor bias is registered as parameter and None bias is registered as attribute.
We need the type annotation because when we try to fold ConvBn in graph mode quantization we'll
remove the None bias attribute and add a Tensor bias attribute, without type annotation the
bias Value in the graph  will be marked with different type in these two cases, so we have rewrite the
graph to change the type as well in that case. But with type annotation we don't need to modify the graph
since both cases the bias value will have type `Tensor?`

Test Plan:
.

Imported from OSS

Differential Revision: D19844710

fbshipit-source-id: 52438bc72e481ab78560533467f9379a8b0b0cfa
2020-02-18 20:09:18 -08:00
feaa622fc6 [Update transforms.py]Add TanhTransform (#19785)
Summary:
Resolves https://github.com/pytorch/pytorch/issues/33195
Pull Request resolved: https://github.com/pytorch/pytorch/pull/19785

Differential Revision: D19642395

Pulled By: ezyang

fbshipit-source-id: 73c386fb89cd195201757b5fa47d6c01914a1f8f
2020-02-18 17:42:10 -08:00
43e015f4b1 Bug fix in dynamic quantization kernels + better test coverage. (#33320)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/33320

Reviewed By: supriyar

Differential Revision: D19893911

Pulled By: AshkanAliabadi

fbshipit-source-id: e79dd06af333c6629e3412315550814da28d9c24
2020-02-18 15:32:44 -08:00
f1b73799d5 Clean up isinstance flags (#33265)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33265

This removes the need for isinstance to keep trace of list and tuple
separately by introducing AnyListType and AnyTupleType into the JIT
type system to be the common supertype of any lists or tuples.

This allows us to remove the weird flags from the interpreter for
the isinstance operator.

Test Plan: Imported from OSS

Differential Revision: D19883933

Pulled By: zdevito

fbshipit-source-id: f998041b42d8b4554c5b99f4d95d1d42553c4d81
2020-02-18 15:07:06 -08:00
7f2c25b6fa Move special ops into interpreter (#32889)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32889

Common primitive ops that have special inputs make it very hard to
serialize the bytecode for mobile because information about how the
op behaves is hidden in the Node*. This changes how we handle the following
ops so that they are encoded as their own interpreter bytecodes.

```
    USES NODE: prim::TupleUnpack(...) -> (...)
    USES NODE: prim::TupleSlice(...) -> (...)
    USES NODE: prim::TupleConstruct(...) -> (...)
    USES NODE: prim::ListUnpack(...) -> (...)
    USES NODE: prim::ListConstruct(...) -> (...)
    USES NODE: prim::DictConstruct(...) -> (...)
    USES NODE: prim::Constant() -> (...)
    USES NODE: prim::isinstance(...) -> (...)
    USES NODE: prim::CreateObject(...) -> (...)
    USES NODE: prim::fork(...) -> (...)
    USES NODE: aten::warn(str message, *, int stacklevel=2) -> () # need stack level information, so ideally in interpreter so it can look at the stack
```

This leaves a state where the _only_ remaining Node*-consuming builtins
are things that are only introduced during JIT optimization and will
not appear in mobile code.

Serialization of bytecode can now be made to directly write the CodeImpl
object without modification.

Test Plan: Imported from OSS

Differential Revision: D19673157

Pulled By: zdevito

fbshipit-source-id: 7b8c633d38a4c783b250fbdb222705e71a83ad26
2020-02-18 15:07:01 -08:00
83c347ff4a Remove prim::Constant op (#32804)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32804

Constants are interpreter primitives so the op was not actually used.
This cleans up some of the logic around it.

This also fixes constant prop such that failures to look up an op
do not silently stop constant propagation. Instead, only errors
inside the op implementation itself will do this.

Test Plan: Imported from OSS

Differential Revision: D19673156

Pulled By: zdevito

fbshipit-source-id: 7beee59a6a67a6c2f8261d86bd505280fefa999e
2020-02-18 15:06:56 -08:00
c59e35b147 interpreter handling for varargs to remove need for looking at Node (#32791)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32791

When a registered operator has varags (ends with ... in its schema),
the interpreter now appends the number of arguments to the top of
the stack before invoking the operator. This allows the removal of more
uses of Node* in the interpreter.

This PR also then cleans up the constructors for Operator to make
it more likely someone chooses the correct one. After making these ops:

```
USES NODE: prim::TupleUnpack(...) -> (...)
USES NODE: prim::TupleSlice(...) -> (...)
USES NODE: prim::TupleConstruct(...) -> (...)
USES NODE: prim::ListUnpack(...) -> (...)
USES NODE: prim::ListConstruct(...) -> (...)
USES NODE: prim::DictConstruct(...) -> (...)
USES NODE: prim::Constant() -> (...)
USES NODE: prim::isinstance(...) -> (...)
USES NODE: prim::CreateObject(...) -> (...)
USES NODE: prim::fork(...) -> (...)
USES NODE: aten::warn(str message, *, int stacklevel=2) -> () # need stack level information, so ideally in interpreter so it can look at the stack
```

Into interpreter primitives, we can remove all but two constructors for operators:
one that is (schema_string, operation), and one that is (symbol, op_creator) for
the remaining weird primitives.

Test Plan: Imported from OSS

Differential Revision: D19673158

Pulled By: zdevito

fbshipit-source-id: 95442a001538a6f53c1db4a210f8557ef118de66
2020-02-18 15:04:48 -08:00
da015c77a1 Cummax and Cummin doc update and performance benchmark (#32537)
Summary:
[CPU] Benchmark results for cummax, cummin:

In [1]: import torch

In [2]: x=torch.randn(5,6,7).cuda()

In [3]: %timeit x.cummax(0)
134 µs ± 1.59 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

In [4]: %timeit x.max(0)
114 µs ± 560 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

In [5]: %timeit x.cummax(1)
134 µs ± 760 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

In [6]: %timeit x.max(1)
118 µs ± 514 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

In [7]: %timeit x.cumsum(0)
97.1 µs ± 6.93 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

In [8]: %timeit x.cumprod(0)
83.6 µs ± 689 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

In [9]: %timeit x.cumprod(1)
86.3 µs ± 528 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

In [10]: y=torch.randn(5,6,7)

In [11]: %timeit y.cummax(0)
148 µs ± 1.43 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

In [12]: %timeit y.max(0)
111 µs ± 125 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

In [13]: %timeit y.cumsum(0)
54.8 µs ± 311 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

In [14]: %timeit y.cumprod(0)
56.2 µs ± 836 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32537

Differential Revision: D19951171

Pulled By: anjali411

fbshipit-source-id: cf972c550189473e9ce62e24ac7dd34b9373fef9
2020-02-18 14:12:25 -08:00
016d73bd74 remove Complex CPU/CUDA backend enum keys (#33267)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/33267

Test Plan: Imported from OSS

Differential Revision: D19907696

Pulled By: anjali411

fbshipit-source-id: 78cc55344313387c4b05bb003688915cee64e3be
2020-02-18 13:38:39 -08:00
1d743e3154 Add guard elimination support for aten::unsqueeze. (#33371)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/33371

Differential Revision: D19920041

Pulled By: resistor

fbshipit-source-id: 906af47676dba014c31eef069a4753207f2efc60
2020-02-18 13:22:58 -08:00
1af30451e5 sync srcs between fbcode and ovrsource targets (#33368)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33368

reorganizing files that describe sources to ensure the same list is used for both fbcode and ovrsource targets. (BUCK vs TARGETS)

Test Plan: CI green

Reviewed By: malfet

Differential Revision: D19803036

fbshipit-source-id: 69c1fa10877c3f0c0e9c1517784949c3c9939710
2020-02-18 13:00:43 -08:00
44af8ee6cd Add pybind11 exception translator (#30588)
Summary:
Closes https://github.com/pytorch/pytorch/issues/30027

The idea here is that you can bind a function with `pybind11` in a single line and without modifying the function:
```cpp
m.def("foo", foo, py::call_guard<torch::PyWarningHandler>());
```
Where warnings are handled by the [`call_guard`](https://pybind11.readthedocs.io/en/stable/advanced/functions.html#call-guard) and exceptions are handled by the `pybind11` exception translator. To do this, I have added support for handling C++ exceptions in `torch::PyWarningHandler`'s destructor without setting the python error state before hand.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30588

Differential Revision: D19905626

Pulled By: albanD

fbshipit-source-id: 90c0a5e298b123cc0c8ab9c52c91be4e96ea47c6
2020-02-18 11:33:29 -08:00
4c8064c9e1 Fix avx-512 detection logic for jit fuser with MSVC 2019 (#33403)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/33401.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33403

Differential Revision: D19949812

Pulled By: soumith

fbshipit-source-id: 00dc3c99b5ba1c13394d5d38bcb148720434b0a3
2020-02-18 11:04:18 -08:00
abbf6e7f53 fix clang-tidy lint (#33448)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/33448

Test Plan: Imported from OSS

Differential Revision: D19952962

Pulled By: suo

fbshipit-source-id: db04bf74f6156edd1bd0716b12f6ca911c84a6bf
2020-02-18 11:02:57 -08:00
4468a7b7b3 Updating submodules
Summary:
GitHub commits:

efc34423b6
75bb459654
fc1945c2e0
332a31a145
2b6eef4dc9

Test Plan: n/a

Reviewed By: 2d2d2d2d2d

fbshipit-source-id: d105b9aa5001c53f884f007406684b73809a7680
2020-02-18 10:21:04 -08:00
f938b3b4e0 Remove TH binding of set_(Tensor). (#33358)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33358

We just translate this code to ATen.

Test Plan: Imported from OSS

Differential Revision: D19911114

Pulled By: gchanan

fbshipit-source-id: 2279e63bb7006f7253620417937e3ce9301e0cdb
2020-02-18 10:10:00 -08:00
879cf0b15a fix typing bug of LambdaLR.__init__ (#33271)
Summary:
## problem

```python
class LambdaLR(_LRScheduler):
    """Sets the learning rate of each parameter group to the initial lr
    times a given function. When last_epoch=-1, sets initial lr as lr.

    Args:
        optimizer (Optimizer): Wrapped optimizer.
        lr_lambda (function or list): A function which computes a multiplicative
            factor given an integer parameter epoch, or a list of such
            functions, one for each group in optimizer.param_groups.
        last_epoch (int): The index of last epoch. Default: -1.

    Example:
        >>> # Assuming optimizer has two groups.
        >>> lambda1 = lambda epoch: epoch // 30
        >>> lambda2 = lambda epoch: 0.95 ** epoch
        >>> scheduler = LambdaLR(optimizer, lr_lambda=[lambda1, lambda2])
        >>> for epoch in range(100):
        >>>     train(...)
        >>>     validate(...)
        >>>     scheduler.step()
    """
```

`LambdaLR` takes a lambda that returns a float and takes a int, or a list of such lambdas.

## related issue

Resolve https://github.com/pytorch/pytorch/issues/32645
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33271

Differential Revision: D19878665

Pulled By: vincentqb

fbshipit-source-id: 50b16caea13de5a3cbd187e688369f33500499d0
2020-02-18 09:10:00 -08:00
2c99ea8654 Dirac init compatibility with group convolutions (#32825)
Summary:
Initializing weights of group-conv with init.dirac_, and applying, previously resulted in an output that makes no sense:
```
x = torch.randn([1, 3, 3, 3])
print('input:\n', x)
conv_layer = torch.nn.Conv2d(3, 3, 3, padding=1, groups=3, bias=False)
torch.nn.init.dirac_(conv_layer.weight.data)
print('\noutput (before this PR):\n',conv_layer(x))

input:
 tensor([[[[ 0.5369, -1.1428,  0.1031],
          [ 0.4638, -0.0854, -0.6553],
          [ 0.8321, -2.5926, -0.3214]],

         [[-0.2289, -0.0895,  0.4407],
          [ 1.2309, -1.2096, -1.5216],
          [-0.1798,  1.1694,  0.3469]],

         [[ 0.1905,  0.8095,  0.5490],
          [-0.4525, -0.4284, -0.1141],
          [ 1.1857, -0.9246, -0.5119]]]])

output (before this PR):
 tensor([[[[ 0.5369, -1.1428,  0.1031],
          [ 0.4638, -0.0854, -0.6553],
          [ 0.8321, -2.5926, -0.3214]],

         [[ 0.0000,  0.0000,  0.0000],
          [ 0.0000,  0.0000,  0.0000],
          [ 0.0000,  0.0000,  0.0000]],

         [[ 0.0000,  0.0000,  0.0000],
          [ 0.0000,  0.0000,  0.0000],
          [ 0.0000,  0.0000,  0.0000]]]], grad_fn=<MkldnnConvolutionBackward>)
````

This PR allows introducing groups to the initialization:
```
torch.nn.init.dirac_(conv_layer.weight.data, groups=3)
print('output (after this PR):\n', conv_layer(x))

output (after this PR):
 tensor([[[[ 0.5369, -1.1428,  0.1031],
          [ 0.4638, -0.0854, -0.6553],
          [ 0.8321, -2.5926, -0.3214]],

         [[-0.2289, -0.0895,  0.4407],
          [ 1.2309, -1.2096, -1.5216],
          [-0.1798,  1.1694,  0.3469]],

         [[ 0.1905,  0.8095,  0.5490],
          [-0.4525, -0.4284, -0.1141],
          [ 1.1857, -0.9246, -0.5119]]]], grad_fn=<MkldnnConvolutionBackward>)
```

When out_channels is different than input_channels, it does the natural thing which is applying identity in each group separately:

```
x = torch.randn([1, 2, 3, 3])
print('input:\n', x)
conv_layer = torch.nn.Conv2d(2, 4, 3, padding=1, groups=2, bias=False)
torch.nn.init.dirac_(conv_layer.weight.data, groups=2)
print('\noutput:\n', conv_layer(x))

input:
 tensor([[[[ 1.2205, -0.6608,  0.8640],
          [-0.5464,  1.1288,  1.4726],
          [-0.6693,  0.4000, -1.7613]],

         [[-0.8760, -0.8814, -0.4705],
          [ 0.6283, -0.5943,  0.6873],
          [-0.6852,  1.4723,  0.3325]]]])

output:
 tensor([[[[ 1.2205, -0.6608,  0.8640],
          [-0.5464,  1.1288,  1.4726],
          [-0.6693,  0.4000, -1.7613]],

         [[ 0.0000,  0.0000,  0.0000],
          [ 0.0000,  0.0000,  0.0000],
          [ 0.0000,  0.0000,  0.0000]],

         [[-0.8760, -0.8814, -0.4705],
          [ 0.6283, -0.5943,  0.6873],
          [-0.6852,  1.4723,  0.3325]],

         [[ 0.0000,  0.0000,  0.0000],
          [ 0.0000,  0.0000,  0.0000],
          [ 0.0000,  0.0000,  0.0000]]]], grad_fn=<MkldnnConvolutionBackward>)
```

Argument 'groups' defaults to 1 so it is backward compatible.

Tests are modified to include cases of with groups>1 but also contain groups=1 cases.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32825

Differential Revision: D19859926

Pulled By: vincentqb

fbshipit-source-id: 9dfdd24471ff14d79c442dfd28c1891aff812fdf
2020-02-18 09:00:12 -08:00
28c5213a97 Add mechanism to pass a number of workers to cpp extensions (#33346)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33346

Fixes #33091

This PR lets users control the number of workers that cpp extensions
uses through the environment variable `MAX_JOBS`. If the environment
variable is a non-negative integer we use that many threads; otherwise,
ninja falls back to the default.

I chose to use the name `MAX_JOBS` because we use it in PyTorch already
to control the number of workers PyTorch builds with. There is a risk
that users of cpp extensions already have `MAX_JOBS` set but we are
hoping that that risk is small and/or it means semantically the same
thing.

Test Plan: - tested locally

Differential Revision: D19911645

Pulled By: zou3519

fbshipit-source-id: d20ed42de4f845499ed38f1a1c73e9ccb620f780
2020-02-18 06:48:11 -08:00
cfb4862673 [pytorch] correct input size check for GroupNorm (#33008)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33008

Corrects D19373507 to allow valid use cases that fail now. Multiplies batch size by the number of elements in a group to get the correct number of elements over which statistics are computed.

**Details**:
The current implementation disallows GroupNorm to be applied to tensors of shape e.g. `(1, C, 1, 1)` to prevent cases where statistics are computed over 1 element and thus result in a tensor filled with zeros.
However, in GroupNorm the statistics are calculated across channels. So in case where one has an input tensor of shape `(1, 256, 1, 1)` for `GroupNorm(32, 256)`, the statistics will be computed over 8 elements and thus be meaningful.

One use case is [Atrous Spatial Pyramid Pooling (ASPPPooling)](791c172a33/torchvision/models/segmentation/deeplabv3.py (L50)), where GroupNorm could be used in place of BatchNorm [here](791c172a33/torchvision/models/segmentation/deeplabv3.py (L55)). However, now this is prohibited and results in failures.

Proposed solution consists in correcting the computation of the number of elements over which statistics are computed. The number of elements per group is taken into account in the batch size.

Test Plan: check that existing tests pass

Reviewed By: fmassa

Differential Revision: D19723407

fbshipit-source-id: c85c244c832e6592e9aedb279d0acc867eef8f0c
2020-02-18 06:43:53 -08:00
dde2ff4608 [Fuser] Add a knob for disabling/enabling CUDA fuser. (#33395)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33395

By default the GPU fuser stays enabled, but this function allows to
manually disable it. It will be useful for working on other
implementations of fuser.

Test Plan: Imported from OSS

Differential Revision: D19926911

Pulled By: ZolotukhinM

fbshipit-source-id: 7ea9d1dd7821453d640f81c487b63e1d585123c4
2020-02-17 21:28:09 -08:00
a203dc2e6d [C++ API] Allow skipping default arguments in module's forward method when module is used in Sequential (#33027)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33027

This PR allows default arguments in module's forward method to be skipped when module is used in `torch::nn::Sequential`, by introducing the `FORWARD_HAS_DEFAULT_ARGS` macro and requiring that all modules that have default arguments in its forward method must have a corresponding `FORWARD_HAS_DEFAULT_ARGS` macro call.

Fixes issue mentioned in https://github.com/pytorch/pytorch/issues/30931#issuecomment-564144468.

Test Plan: Imported from OSS

Differential Revision: D19777815

Pulled By: yf225

fbshipit-source-id: 73282fcf63377530063e0092a9d84b6c139d2e32
2020-02-17 20:38:02 -08:00
4724964810 [C++ API] Expose AnyValue and AnyModuleHolder classes (#33026)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33026

This PR contains necessary changes to prepare for https://github.com/pytorch/pytorch/pull/33027. It exposes the following classes to public:
1. `torch::nn::AnyValue`, because if the user has optional arguments in their module's forward method, they must also use the `FORWARD_HAS_DEFAULT_ARGS` macro and pass in the default values for those optional arguments wrapped by `torch::nn::AnyValue`.
2. `torch::nn::AnyModuleHolder`, because `torch::nn::Module` needs to declare it as a friend class for it to be able to access `torch::nn::Module`'s protected methods such as `_forward_has_default_args` / `_forward_num_required_args` / `_forward_populate_default_args`.

Test Plan: Imported from OSS

Differential Revision: D19777814

Pulled By: yf225

fbshipit-source-id: 1c9d5aa24f0689154752c426a83ee98f64c9d02f
2020-02-17 20:35:22 -08:00
5d7f42847c Add at::Tensor::retain_grad API (#33349)
Summary:
This PR adds `at::Tensor::retain_grad`, and its implementation mirrors the Python `torch.Tensor.retain_grad` API:
c6271c63f2/torch/tensor.py (L292-L315)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33349

Differential Revision: D19944524

Pulled By: yf225

fbshipit-source-id: e61d5d761996b6d1b860c04c4b4650c1a49a6a8c
2020-02-17 20:03:48 -08:00
55fa133cdc Remove gpu_kernel_with_index (#33370)
Summary:
Although `gpu_kernel_with_index` might look like a quite general helper function at first look, it actually isn't.

The problem is not only 32bit indexing, but something more fundamental: `TensorIterator` reorder dims and shapes, so if you have non-contiguous tensor such as `torch.empty(5, 5).t()` , the index won't be correct. Since the whole point of `TensorIterator` is to manipulate shapes/strides to speedup loops, it is fundamentally impossible to get the correct linear index without tons of efforts.

Currently, the range factories are not failing on an `out=non_contiguous_tensor`  is because it is so lucky that  `has_internal_overlap` is stupid enough to return everything not contiguous as `TOO_HARD`.

Since `gpu_kernel_with_index` is not general, we should move it from `Loops.cuh` to `RangeFactories.cu`. And since the kernel is so simple to implement, it makes no sense to use `TensorIterator` which goes through tons of unnecessary checks like `compute_dtypes`.

`torch.range` is not tested for 64bit-indexing, and I will file a new PR to remove it (it was supposed to be removed at 0.5).

Benchmark:
The device is GTX-1650, I don't have a good GPU at home.

Code:
```python
import torch
print(torch.__version__)

for i in range(100):
    torch.randn(1000, device='cuda')
torch.cuda.synchronize()

for i in range(15, 29):
    %timeit torch.arange(2 ** i, device='cuda'); torch.cuda.synchronize()
```

Before:
```
1.5.0a0+c37a9b8
11.9 µs ± 412 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
12.7 µs ± 309 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
19.6 µs ± 209 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
28.9 µs ± 923 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
48.4 µs ± 1.64 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
85.7 µs ± 1.46 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
162 µs ± 1.09 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
312 µs ± 9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
618 µs ± 15.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
1.22 ms ± 9.91 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
2.45 ms ± 97.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
4.9 ms ± 155 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
10.1 ms ± 378 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
```

After:
```
1.5.0a0+7960d19
11 µs ± 29.6 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
12.4 µs ± 550 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
18.4 µs ± 230 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
27.6 µs ± 10.9 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
46.2 µs ± 18.6 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
83.3 µs ± 5.61 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
158 µs ± 373 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
307 µs ± 1.44 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
603 µs ± 112 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
1.2 ms ± 1.05 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
2.4 ms ± 23.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
4.77 ms ± 25.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
9.51 ms ± 933 ns per loop (mean ± std. dev. of 7 runs, 100 loops each)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33370

Differential Revision: D19925990

Pulled By: ngimel

fbshipit-source-id: f4a732fe14a5582b35a56618941120d62e82fdce
2020-02-17 17:15:04 -08:00
ebb008eb68 Optimize Unfold3dAcc to improve performance of conv3d backward (#33317)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33317

Optimize Unfold3dAcc to improve performance of conv3d backward

Test Plan: buck test mode/dev-nosan //caffe2/test:nn -- "Conv3d"

Reviewed By: houseroad

Differential Revision: D19892678

fbshipit-source-id: 18873dd1d1409263d9925840db302b21fb3b490d
2020-02-17 14:49:02 -08:00
c90b393c00 Fix logging for aborted communicators in ProcessGroupNCCL. (#33147)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33147

The log mentioned that it is aborting communicators even if
`blockingWait_` was false. This was incorrect, and I updated the logging to
reflect the appropriate behavior.
ghstack-source-id: 98025017

Test Plan: waitforbuildbot

Differential Revision: D19817967

fbshipit-source-id: fb3415af2cc99eb20981ceaa5203c0a1880fd6f3
2020-02-17 14:42:51 -08:00
1a589f50bd [auto quant] Add quant_scheme_generator to interface with dper
Summary:
Add quant_scheme_generator that will be used to interface with dper.

Also updated two related functions:

- Add batch_size option to save_local_dataset() in dataset utils to be more flexible.

Test Plan:
Tested in the stacked diff D19747206.

buck test deeplearning/numeric_suite/toolkit/test:int8_static_utils_test

Reviewed By: csummersea

Differential Revision: D19745159

fbshipit-source-id: a4ac1ef0ffdddc68bdf5e209ae801b8c475d0b96
2020-02-17 10:41:22 -08:00
87dc2dbcce Updating submodules
Summary:
GitHub commits:

19c040cb01

Test Plan: n/a

Reviewed By: zpao

fbshipit-source-id: ddc41000622a682874ab3a11fdf4a91038f9c15f
2020-02-16 23:57:14 -08:00
c57f8984e6 [caffe2] make order btw div and mul in adgrad consistent (#32974)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32974

Pull Request resolved: https://github.com/pytorch/FBGEMM/pull/286

Re-attempt of D18805426 . Decided to be consistent with PyTorch Adagrad

There was an inconsistency in the order of operation between scalar and SIMD code when we compute Adagrad. This diff make them consistent by doing w += lr * grad / (sqrt(moment) + epsilon) in Adagrad and w += lr / (sqrt(moment) + epsilon) * grad in RowWiseSparseAdagrad.

The Adagrad order is consistent with PyTorch (see aten/src/ATen/native/cpu/PointwiseOpsKernel.cpp addcmul_cpu_kernel function). The RowWiseSparseAdagrad order is to make compute more efficient. In RowWiseSparseAdagrad, lr / (sqrt(moment) + epsilon) is shared among all elements in the row

And, we're not going to use FMA to be consistent with PyTorch (even though it provides a little accuracy benefit)

Test Plan: CI

Reviewed By: wx1988

Differential Revision: D19342865

fbshipit-source-id: e950c16f2e1c4a2f2a3ef53b1705db373c67f341
2020-02-16 22:45:59 -08:00
d29997373e Updating submodules
Summary:
GitHub commits:

80dda47903
797af57bb6
b2fceb9d05

Test Plan: n/a

Reviewed By: zpao

fbshipit-source-id: dde5fb9abca185422df11dc61c658dc333ad63ca
2020-02-16 21:01:37 -08:00
d4e4beddc4 Revert D19871946: [distributed] pass in timeout to TCP store when initializing
Test Plan: revert-hammer

Differential Revision:
D19871946

Original commit changeset: dd002180c4c8

fbshipit-source-id: 40b0676c51e43366c0700e81d16cc7927ee8efc2
2020-02-16 19:37:44 -08:00
df47a3abe0 [distributed] pass in timeout to TCP store when initializing (#33325)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33325

Closes https://github.com/pytorch/pytorch/issues/32924. There was a bug where for TCPStore, we would not respect the timeout passed into `init_process_group` while constructing the TCPStore. Instead, we'd set the timeout after the rendezvous created the store, meaning that we used the default timeout of 300s while connecting to the server. This diff passes the timeout passed into `init_process_group` to rendezvous so that it can be passed into the constructor for TCPStore, so that we can use the right timeout at construction time.

Question: Should we make this change for FileStore as well? Currently the FileStore constructor does not take in a timeout at all.
ghstack-source-id: 98401875

Test Plan: Added a UT

Differential Revision: D19871946

fbshipit-source-id: dd002180c4c883216645b8a97cc472c6116ac117
2020-02-16 17:59:44 -08:00
c75d06d854 Move gating part of SparseFeatureGating to local
Summary: in dper2, local net is hard-coded by whitelisting some layers. Add SparseFeatureGating related layers to local net explicitly.

Test Plan:
* workflow: f167812211
* QRT: fall back looks normal

{F228442018}

Differential Revision: D19852280

fbshipit-source-id: 6fecc3d745c3f742d029575a7b9fe320618f1863
2020-02-16 14:18:27 -08:00
f6808df75f [BC] Temporarily fix the BC check (#33387)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33387

CI is broken. Skip two functions to fix the problem.

Test Plan: ci

Reviewed By: hl475

Differential Revision: D19926249

fbshipit-source-id: a46d1465c59de8616d2af5fb0b9cc18532359f88
2020-02-15 18:31:25 -08:00
495bd5818b Fix index truncation in argmin/max for large tensors (#33310)
Summary:
Fixes the `TensorIterator` parts of https://github.com/pytorch/pytorch/issues/32863 (THC is still broken)

`TensorIterator::split` now keeps track of the `view_offsets` into the full tensor range. With this, I can take the base offset for the reduced dimension and translate partial results from the sub-iter into the index range of the full tensor. This happens only once for each intermediate result, so we should still benefit from the performance of 32-bit indexing in loops.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33310

Differential Revision: D19906136

Pulled By: ngimel

fbshipit-source-id: 3372ee4b8d5b115a53be79aeafc52e80ff9c490b
2020-02-15 17:24:55 -08:00
cd038c0ae9 Get rid of some template arguments in GPU loop (#33308)
Summary:
Globally define
```C++
constexpr int num_threads = C10_WARP_SIZE * 2;
constexpr int thread_work_size = 4;
constexpr int block_work_size = thread_work_size * num_threads;
```
and kill all the template arguments passing these values.

These are effectively global, but we are now passing them around by template arguments, causing many inconvenience in coding.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33308

Differential Revision: D19907250

Pulled By: ngimel

fbshipit-source-id: 4623b69baea7e6e77f460ffdfa07cf9f8cba588a
2020-02-15 15:17:46 -08:00
fd684cc312 Use torch.set_default_dtype in test_data_parallel and rename dtype2prec (#32962)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32962

As per gchanan's comments on
https://github.com/pytorch/pytorch/pull/30445, I've used
`torch.set_default_dtype` in test_data_parallel instead of specifying
dtype=torch.double everywhere. Also, renamed dtype2prec to dtype2prec_DONTUSE
ghstack-source-id: 98388429

Test Plan: waitforbuildbot

Differential Revision: D19714374

fbshipit-source-id: eb55bbca33881625636ba9ea6dd4cb692f25668e
2020-02-15 14:07:54 -08:00
6dd6b0bfae Revert D19900566: [pytorch][PR] Simplify prim::shape when we have complete tensor types.
Test Plan: revert-hammer

Differential Revision:
D19900566

Original commit changeset: c8eaad70c8ea

fbshipit-source-id: 764f2139fdf19f22a397694d011078ec525f5e8a
2020-02-15 11:37:35 -08:00
d35a4c202e Add support for aten::slice to guard elimination. (#33311)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/33311

Differential Revision: D19911105

Pulled By: resistor

fbshipit-source-id: 402cfe5f2e03a62b78ed13157e1462cefd9eeafb
2020-02-14 22:54:37 -08:00
c37a9b874b Updating submodules
Summary:
GitHub commits:

65758fd3b1
fb73204584
618f71a795

Test Plan: n/a

Reviewed By: zpao

fbshipit-source-id: 814ebbcf35bcecc62ec64854a26ea645d651fbc2
2020-02-14 20:48:09 -08:00
1e76649d30 fast setup for output tensor in tensor iterator (#33165)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/33165

Test Plan: Imported from OSS

Differential Revision: D19825853

Pulled By: glaringlee

fbshipit-source-id: 8f908f2e93a4e377306a77e8a771208603b20e72
2020-02-14 20:34:50 -08:00
c6271c63f2 Updating submodules
Summary:
GitHub commits:

46fd5fed10
87cd6087c6

Test Plan: n/a

Reviewed By: zpao

fbshipit-source-id: 402427af823fe31ac1f6e18c5a020ec6ec7cc1af
2020-02-14 20:04:48 -08:00
e1a895858f Allow to register custom passes both before and after fusion. (#33261)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33261

It was requested in #33114.

Test Plan: Imported from OSS

Differential Revision: D19910600

Pulled By: ZolotukhinM

fbshipit-source-id: 827f1744b97f386065a21d1ba5d82c1f90edbe46
2020-02-14 16:28:52 -08:00
3359871f5d .circleci: Use volume mounts instead of docker cp (#33355)
Summary:
docker cp was erroring out, so lets just use volume mounts instead which
should hopefully be more consistent

Signed-off-by: Eli Uriegas <eliuriegas@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33355

Differential Revision: D19913948

Pulled By: seemethere

fbshipit-source-id: 059ddd36a8162f946cfea451b5dcd1706f1209e9
2020-02-14 15:32:57 -08:00
dfafe2aad1 .cirlceci: Swap PYTORCH_BUILD_VERSION if on tag (#33326)
Summary:
Basically just fills out PYTORCH_BUILD_VERSION to the correct version
baesd on the git tag.

This makes it so that we don't have to continually edit this file
when doing releases.

Signed-off-by: Eli Uriegas <eliuriegas@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33326

Differential Revision: D19911035

Pulled By: seemethere

fbshipit-source-id: e27105f3e193a49dd68452d8f60232f8a132acad
2020-02-14 14:43:29 -08:00
5cab54e0db Revert D19560159: [RPC Reliability] Implemented retries for RPCs with exponential backoff
Test Plan: revert-hammer

Differential Revision:
D19560159

Original commit changeset: 40cd86f9a25d

fbshipit-source-id: 70f5b19bc05fc34e3c912f42f9d32b9fb80aed06
2020-02-14 14:29:59 -08:00
0b5b2b864a [BC-Breaking] Rename at::Tensor::base() to _base() (#33316)
Summary:
This PR renames `at::Tensor::base()` to `at::Tensor::_base()`, to achieve parity with Python `torch.Tensor._base` API.

----

This PR is BC-breaking in the following way:

Previously, to get the tensor that this tensor is a view of, the user would call `tensor.base()` in C++. Now, they must call `tensor._base()`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33316

Differential Revision: D19905687

Pulled By: yf225

fbshipit-source-id: 949d97b707b2c82becb99ac89e9ac24359d183e6
2020-02-14 14:06:58 -08:00
9c0625b004 [iOS] Add watchOS support (#33318)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33318

### Summary

Recently, we have a [discussion](https://discuss.pytorch.org/t/libtorch-on-watchos/69073/14) in the forum about watchOS. This PR adds the support for building watchOS  libraries.

### Test Plan

- `BUILD_PYTORCH_MOBILE=1 IOS_PLATFORM=WATCHOS ./scripts/build_ios.sh`

Test Plan: Imported from OSS

Differential Revision: D19896534

Pulled By: xta0

fbshipit-source-id: 7b9286475e895d9fefd998246e7090ac92c4c9b6
2020-02-14 14:02:22 -08:00
ecd9a5ad12 Simplify prim::shape when we have complete tensor types. (#33336)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/33336

Differential Revision: D19900566

Pulled By: resistor

fbshipit-source-id: c8eaad70c8ea57ebbe920dcfbdaf6a9435b49506
2020-02-14 13:53:08 -08:00
9c8b67b179 Revert D19905015: Revert D19858239: [pytorch][PR] Refactor and add VS 14.16 and 2019 CI for Windows
Test Plan: revert-hammer

Differential Revision:
D19905015

Original commit changeset: b117e44d5552

fbshipit-source-id: a10c78aed953434f69f466bdd36f914334ba82f3
2020-02-14 13:42:29 -08:00
b730c5a3bd remove dispatch key (#33266)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/33266

Test Plan: Imported from OSS

Differential Revision: D19907697

Pulled By: anjali411

fbshipit-source-id: 99fc06b7c41229e8d9ed4271de62247cda12ee6e
2020-02-14 13:26:15 -08:00
6ade7e3a15 [ROCm] Enable 3D convolutions through ROCm (#33067)
Summary:
For both the Caffe2 and PyTorch backends, enable 3D convolutions through MIOpen.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33067

Reviewed By: BIT-silence

Differential Revision: D19880495

Pulled By: bddppq

fbshipit-source-id: 8f6f970910654c1c5aa871b48a04c1054875691c
2020-02-14 13:19:10 -08:00
9823662b43 [ONNX] Export split with list of sizes (#33161)
Summary:
Exporting Split with a dynamic list of split_sizes is not supported.
This PR enables export using onnx SplitToSequence + SequenceAt
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33161

Reviewed By: hl475

Differential Revision: D19860152

Pulled By: houseroad

fbshipit-source-id: 300afedc22b01923efb23acd1a3627aa146bb251
2020-02-14 12:46:33 -08:00
e9e9331927 Fractional Max Pooling: output ratios defined as double (#33304)
Summary:
References https://github.com/pytorch/pytorch/issues/33240
Changes options.output_ratio from long integer to double to allow ratios to used to calculate output size from inputs.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33304

Differential Revision: D19887318

Pulled By: yf225

fbshipit-source-id: 228c2c6bf4158307700c2a983d27d539c6b9eded
2020-02-14 12:31:39 -08:00
243cc20451 Enable inplace relu fusion for training (#33105)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33105

Support inplace relu for Conv+BN+Relu fusion during training.
ghstack-source-id: 97944659

Test Plan: buck test caffe2/test:quantization --  'test_fuse_module_train \(test_quantization\.FusionTest\)' --print-passing-details

Differential Revision: D19795221

fbshipit-source-id: 056dc06050d145750c4d0044c0fc1c3febcfdafc
2020-02-14 12:15:58 -08:00
8245641091 Re-activate binary_macos_libtorch_2_7_cpu_build and binary_macos_li… (#33321)
Summary:
Re-send the PR as Intel has restored the relevant packages.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33321

Differential Revision: D19894221

Pulled By: zhangguanheng66

fbshipit-source-id: bc19dcfa5b17ff047f9ae09ebd8eadfb01f7ed68
2020-02-14 12:01:56 -08:00
92b67c03e4 [RPC Reliability] Implemented retries for RPCs with exponential backoff (#32602)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32602

This adds functionality for re-trying RPC's that are sent with the function `sendWithRetries()`. It adds RPC's that will potentially need to be retried to a sorted map that contains the timeout at which to retry the RPC and associated metadata. A separate thread iteratively removes the earliest retry-able RPC from the map, sleeps until the corresponding time point, re-tries the RPC, and adds to the map again with a future timeout.

GitHub Issue: https://github.com/pytorch/pytorch/issues/32124

Per the first 3 milestones, the following will be addressed in future PR's:
* enabling RPC Retries for RRef internal messages

Differential Revision: D19560159

fbshipit-source-id: 40cd86f9a25dc24367624d279a3b9720b20824cf
2020-02-14 11:57:24 -08:00
ae53f8dd25 Revert D19859905: [pytorch][PR] Gradient scaling API
Test Plan: revert-hammer

Differential Revision:
D19859905

Original commit changeset: bb8ae6966214

fbshipit-source-id: 28f1c93e8a00e3a4bbe8cc981499b15468f0b970
2020-02-14 11:03:27 -08:00
b276ddda38 remove THC dist code which nerver be used (#33283)
Summary:
Remove THC dist code which nerver be used.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33283

Differential Revision: D19905361

Pulled By: gchanan

fbshipit-source-id: 367fd31e2209d36b30af31511554fdbdd67c98e4
2020-02-14 10:37:23 -08:00
4bef344210 Implementation of mixture distributions (#22742)
Summary:
Addressing issue https://github.com/pytorch/pytorch/issues/18125
This implements a mixture distributions, where all components are from the same distribution family. Right now the implementation supports the ```mean, variance, sample, log_prob``` methods.

cc: fritzo and neerajprad

- [x] add import and `__all__` string in `torch/distributions/__init__.py`
- [x] register docs in docs/source/distributions.rst

### Tests
(all tests live in tests/distributions.py)
- [x] add an `Example(MixtureSameFamily, [...])` to the `EXAMPLES` list,
     populating `[...]` with three examples:
     one with `Normal`, one with `Categorical`, and one with `MultivariateNormal`
     (to exercise, `FloatTensor`, `LongTensor`, and nontrivial `event_dim`)
- [x] add a `test_mixture_same_family_shape()` to `TestDistributions`. It would be good to test this with both `Normal` and `MultivariateNormal`
- [x] add a `test_mixture_same_family_log_prob()` to `TestDistributions`.
- [x] add a `test_mixture_same_family_sample()` to `TestDistributions`.
- [x] add a `test_mixture_same_family_shape()` to `TestDistributionShapes`

### Triaged for follup-up PR?
- support batch shape
- implement `.expand()`
- implement `kl_divergence()` in torch/distributions/kl.py
Pull Request resolved: https://github.com/pytorch/pytorch/pull/22742

Differential Revision: D19899726

Pulled By: ezyang

fbshipit-source-id: 9c816e83a2ef104fe3ea3117c95680b51c7a2fa4
2020-02-14 10:31:56 -08:00
7dde91b0ae Vectorize elu and its backward function on CPU (#32986)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32986

Benchmark: (Debian 10, Release build, gcc 8.3, no turbo, Intel(R) Xeon(R) E-2136 CPU @ 3.30GHz)

```python
import timeit
for op in ('ELU',):
    print('Forward')
    for dtype in ('torch.double', 'torch.float'):
        for n, t in [(10_000, 100000),
                    (100_000, 10000)]:
            print(f'torch.nn.{op}()(a), numel() == {n} for {t} times, dtype={dtype}')
            print(timeit.timeit('m(a)', setup=f'import torch; m = torch.nn.{op}(); a = torch.linspace(-1, 1, {n}, dtype={dtype})', number=t))
    print('Backward')
    for dtype in ('torch.double', 'torch.float'):
        for n, t in [(20_000, 100000),
                    (200_000, 10000)]:
            print(f'torch.nn.{op}()(a), numel() == {n} for {t} times, dtype={dtype}')
            print(timeit.timeit('y.backward(retain_graph=True)',
                                setup=f'import torch; m = torch.nn.{op}(); a = torch.linspace(-1, 1, {n}, requires_grad=True, dtype={dtype}); x = m(a); y = x.sum()',
                                number=t))
```

Before:

```
Forward
torch.nn.ELU()(a), numel() == 10000 for 100000 times, dtype=torch.double
5.292799739996553
torch.nn.ELU()(a), numel() == 100000 for 10000 times, dtype=torch.double
4.828570917001343
torch.nn.ELU()(a), numel() == 10000 for 100000 times, dtype=torch.float
3.1359513780043926
torch.nn.ELU()(a), numel() == 100000 for 10000 times, dtype=torch.float
2.7030876770004397
Backward
torch.nn.ELU()(a), numel() == 20000 for 100000 times, dtype=torch.double
4.568238995998399
torch.nn.ELU()(a), numel() == 200000 for 10000 times, dtype=torch.double
1.8908141480060294
torch.nn.ELU()(a), numel() == 20000 for 100000 times, dtype=torch.float
3.8652471189998323
torch.nn.ELU()(a), numel() == 200000 for 10000 times, dtype=torch.float
1.13068484600808
```

After:

```
Forward
torch.nn.ELU()(a), numel() == 10000 for 100000 times, dtype=torch.double
2.1265591429983033
torch.nn.ELU()(a), numel() == 100000 for 10000 times, dtype=torch.double
1.6708065870043356
torch.nn.ELU()(a), numel() == 10000 for 100000 times, dtype=torch.float
1.1806934149935842
torch.nn.ELU()(a), numel() == 100000 for 10000 times, dtype=torch.float
0.77735430400935
Backward
torch.nn.ELU()(a), numel() == 20000 for 100000 times, dtype=torch.double
4.494567882007686
torch.nn.ELU()(a), numel() == 200000 for 10000 times, dtype=torch.double
2.007220732004498
torch.nn.ELU()(a), numel() == 20000 for 100000 times, dtype=torch.float
3.615133151994087
torch.nn.ELU()(a), numel() == 200000 for 10000 times, dtype=torch.float
1.105554559995653
```

Test Plan: Imported from OSS

Differential Revision: D19794595

Pulled By: VitalyFedyunin

fbshipit-source-id: c319ec04676ced22179b8b34789ac8bf6428deab
2020-02-14 09:45:17 -08:00
1b2d2ba504 [PyTorch] Fix write-after-free (TSAN) in GraphTask::set_error() (#33156)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33156

When dist_autograd_spawn_thrift's 'test_backward_node_failure_python_udf' test is
run, it was encountering a TSAN error related to holding the mutex while the
underlying datastructure was being dealloced.

In this change, we simply get a shared_ptr<> reference to the future, and
set_exception() without having the lock held, to avoid deallocing underneath
the lock.
ghstack-source-id: 98303434

Test Plan: buck test mode/opt-tsan //caffe2/test/distributed/rpc:dist_autograd_spawn_thrift -- 'test_backward_node_failure_python_udf \(test_dist_autograd_spawn\.DistAutogradTestWithSpawn\)'

Differential Revision: D19821362

fbshipit-source-id: 82f735e33f8e608552418ae71592400fa3621e40
2020-02-14 09:32:17 -08:00
0c98939b7b Revert D19899550: [pytorch][PR] Second try on Von Mises: Make it JIT compatible
Test Plan: revert-hammer

Differential Revision:
D19899550

Original commit changeset: fbcdd9bc9143

fbshipit-source-id: c8a675a8b53f884acd0e6c57bc7aa15faf83d5d6
2020-02-14 08:42:16 -08:00
ff5f38f53b Revert D19858239: [pytorch][PR] Refactor and add VS 14.16 and 2019 CI for Windows
Test Plan: revert-hammer

Differential Revision:
D19858239

Original commit changeset: f068d8505886

fbshipit-source-id: b117e44d5552e157747920d8098ce3b86a29c6bf
2020-02-14 07:35:08 -08:00
b1583ceb1e Second try on Von Mises: Make it JIT compatible (#33177)
Summary:
Follow up from https://github.com/pytorch/pytorch/issues/17168 .
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33177

Differential Revision: D19899550

Pulled By: ezyang

fbshipit-source-id: fbcdd9bc91438164bcb2b1cbc314c765520754e1
2020-02-14 07:16:41 -08:00
ecd3c252b4 Suport all length one SLS op lowering: C2 part (#33332)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33332

We check the input shape of lengths and indices of SLS and add an attribute if they are the same.

Test Plan:
```
buck test glow/fb/test/numerics:test_operator_onnxifinnpi -- test_slws_fused_8bit_rowwise_length1_graph
```

Reviewed By: ipiszy

Differential Revision: D19874903

fbshipit-source-id: 06b643b5351d0ba19ba209b5a5b599fbb38b1dfc
2020-02-13 22:53:11 -08:00
0150f40dde dont force msvc /Ox flag which can conflict with /RTC1 in debug config (#33164)
Summary:
Relates to https://github.com/pytorch/pytorch/issues/33132

This fix doesn't add full multi-configuration support described in https://github.com/pytorch/pytorch/issues/33132 but at least avoid the error presented in the issue when `CMAKE_BUILD_TYPE=Debug` is used with MSVC.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33164

Differential Revision: D19899727

Pulled By: ezyang

fbshipit-source-id: 28a364d920c4a3fb577c6b484ccd69a133fbcf5d
2020-02-13 22:15:20 -08:00
602aec325d Kill old cuda support (#33302)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/33302

Differential Revision: D19899586

Pulled By: ezyang

fbshipit-source-id: 11293475795b4bfee9a65133bb6718649e220787
2020-02-13 21:48:07 -08:00
e5218e3e12 Add missing error messages for container modules (#29991)
Summary:
Container `Module`s, including `ModuleList`, `ParameterList` and `ParameterDict`, should not be called like a regular `Module`.
This PR add error messages for these special modules.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/29991

Differential Revision: D19698535

Pulled By: ezyang

fbshipit-source-id: fe156a0bbb033041086734b38f8c6fde034829bf
2020-02-13 21:34:27 -08:00
92fbf7cf97 [caffe2] use JIT'ed fp16 SLS (#32432)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32432

Use JIT'ed fp16 SLS in D19477209 from Caffe2 operators

Test Plan: CI

Reviewed By: jianyuh

Differential Revision: D19477208

fbshipit-source-id: ef2ccba10f5f4c475166141bf09c266dedb92d38
2020-02-13 21:15:39 -08:00
642bd51043 [ONNX] Skip problematic ONNX test to unblock CI (#33323)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33323

skip the tests until it is fixed

Test Plan: ci

Reviewed By: hl475

Differential Revision: D19894675

fbshipit-source-id: 1cfc153577bf021171f4412115d84719beae7a91
2020-02-13 21:08:27 -08:00
e5c7b7b8b5 Automatic update of fbcode/onnx to 04a29addfd5b912812addb8dea5f8763fbfaad01 (#33328)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33328

Previous import was 8b3f7e2e7a0f2aba0e629e23d89f07c7fc0e6a5e

Included changes:
- **[04a29add](https://github.com/onnx/onnx/commit/04a29add)**: Use // instead of # (#2598) <Lu Fang>
- **[f8e140a9](https://github.com/onnx/onnx/commit/f8e140a9)**: Kezhan/function update (#2596) <Ke Zhang>
- **[6185faae](https://github.com/onnx/onnx/commit/6185faae)**: fix the attribute types section in IR.md (#2590) <Ke Zhang>
- **[f254647a](https://github.com/onnx/onnx/commit/f254647a)**: Allow Constant operator to promote scalar and list to tensors. (#2592) <Jeremy Cochoy>
- **[f12ec799](https://github.com/onnx/onnx/commit/f12ec799)**: Add NegativeLogLikelihood(NllLoss) op (#2551) <liqunfu>

Test Plan: ci

Reviewed By: hl475

Differential Revision: D19897554

fbshipit-source-id: d8efb5c5ac8f9d71727de33c67af681ed8ec8123
2020-02-13 21:03:17 -08:00
93179b1c1c [jit] Initial use RRef in TorchScript (#33190)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33190

This enable the initial RRef type to be used inside TorchScript, user
could pass a python RRef into a torchscript function and call to_here
inside. Specifically, this PR:

- Add RRef schema type parsing
- Add python interop for RRef in Python and into JIT
- register to_here op in register_distributed_ops

More support for RRef in TorchScript will be added in future PRs

Test Plan: Imported from OSS

Differential Revision: D19871244

Pulled By: wanchaol

fbshipit-source-id: 7eca6c491a84666b261c70806254b705603bd663
2020-02-13 20:17:25 -08:00
b2c5896432 [jit] Add RRef to IValue and JIT type system (#32992)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32992

This PR add RRef to IValue and the JIT type system.

- The RRefInterface abstract class inherit from intrusive_ptr_target,
  this made the RRef class can be hold in ivalue as intrusive_ptr

- Add RRefType as a JIT type, it's a container type similar to
future type.

Test Plan: Imported from OSS

Differential Revision: D19871242

Pulled By: wanchaol

fbshipit-source-id: cb80ca32605096f9a42ef147109fb368a7c1d4d3
2020-02-13 20:17:20 -08:00
9ae4d38a21 [rpc] Switch RRef to be managed by intrusive_ptr (#33189)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33189

Add RRefInterface to Aten/Core, which will later be used by IValue

Switch all the rpc code base to use intrusive_ptr instead of shared_ptr,
so that we could add it to IValue.

Actual adding to IValue and JIT will be in next PR

Test Plan: Imported from OSS

Differential Revision: D19871241

Pulled By: wanchaol

fbshipit-source-id: d7e1fd04b46320e0f26c18591b49c92ad30a4032
2020-02-13 20:15:31 -08:00
cb4e6d025a Updates numpy to tensor negative stride error message (#33254)
Summary:
See https://discuss.pytorch.org/t/bugs-about-torch-from-numpy-array/43312.

This update incorporates albanD 's suggestion into the error message, saving future users from having to ask or look on the forums if they encounter this issue and don't mind making their arrays contiguous.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33254

Differential Revision: D19885808

Pulled By: mruberry

fbshipit-source-id: 8f0fd994cf8c088bf3c3940ab4dfb3ddbc5b3ede
2020-02-13 15:38:52 -08:00
a80d0330e4 add int4 fake fp16 mappings
Summary: update this mapping with thte int4 sls ops so we can run netrunner

Test Plan: testing with net_runner

Reviewed By: jfix71

Differential Revision: D19879826

fbshipit-source-id: eac84b10e2365c21cb8a7cfbf3123e26a9945deb
2020-02-13 15:37:23 -08:00
eb9b4b1f29 handle errors in ProcessGroupAgent::listenLoop(). (#32957)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32957

Closes https://github.com/pytorch/pytorch/issues/29703. If there is a
gloo timeout and `recvWork->wait()` times out in `listenLoop()`,
processGroupagent crashes since there is an unhandled exception in a thread.
This catches the exception and exits the listen loop. In a follow up diff, we
will enhance these error conditions so that if users attempt to send RPCs
again, they are notified that the RPC agent was in a bad state and it was
shutdown.

This PR also adds a new option, `processGroupTimeout` to PG agent's backend
options. This allows us to control the gloo timeout.
ghstack-source-id: 98236783

Test Plan: Added a unit test.

Differential Revision: D19678979

fbshipit-source-id: 3895ae754f407b84aca76c6ed3cb087d19178c40
2020-02-13 14:50:05 -08:00
7ae1e023e7 glu: port cpu forward implementation to ATen (#26410)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/26410

I only ported the CPU forward implementation for now to try a CPU-only benchmark.

Test Plan: Imported from OSS

Differential Revision: D17454519

Pulled By: gchanan

fbshipit-source-id: ff757cf972c5627074fea2f92a670129007a49f4
2020-02-13 14:32:25 -08:00
0808485c6a Workaround performance bug / memory leak in GOMP (#32875)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/32008

This is similar to CaoZhongZ's patch which runs on all OpenMP threads in the team and selectively exits early to scale the number of threads active. I have also restored the `if` clause from before https://github.com/pytorch/pytorch/issues/26963 so that running on 1 thread should still avoid additional synchronisation.

One comment is that this does slightly change the meaning of `at::get_num_threads` inside of a `parallel_for` loop since it's not guaranteed that the function was called on that many threads. I've looked at the uses within ATen and couldn't see anything that would be problematic. There are a few places in `quantized` that seem to make this assumption but they always use a grain size of 1 so should be safe:
d9e99ab544/aten/src/ATen/native/quantized/cpu/qconv.cpp (L436-L437)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32875

Differential Revision: D19775823

Pulled By: VitalyFedyunin

fbshipit-source-id: 4f843b78cdb9e2766339590d728923786a00af6d
2020-02-13 14:31:08 -08:00
bbdc5b7bd0 Optimize error checking in mvlgamma (#32665)
Summary:
- Clean up error checking code
- Avoid unecessary floating-point computation
- Use float instead of double when possible to avoid massive cast in the tensor
- Use bool instead of uint8_t for clear Boolean purpose
- Improve error message
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32665

Differential Revision: D19601920

Pulled By: VitalyFedyunin

fbshipit-source-id: 0c6c6b5ff227b1437a6c1bae79b2c4135a13cd37
2020-02-13 14:05:19 -08:00
5b922918d0 Disable flaky test TestCppExtensionAOT.test_cuda_extension in Windows CI (#33282)
Summary:
See https://github.com/pytorch/pytorch/issues/33270 for details.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33282

Differential Revision: D19886975

Pulled By: yf225

fbshipit-source-id: 7e6756095b1bb8c55fc5acb8fc2cb02c1e89b032
2020-02-13 13:10:44 -08:00
0c93c2b142 Add a warning sign for anomaly detection (#33176) (#33239)
Summary:
Fixes [33176](https://github.com/pytorch/pytorch/issues/33176)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33239

Differential Revision: D19879847

Pulled By: albanD

fbshipit-source-id: 594b936c10f98c364331e782b64f42059413a741
2020-02-13 12:52:21 -08:00
6c6a814a2c Beef up documentation on DispatchKey.h (#33011)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33011

I also reordered some of the keys in non-semantic ways to make the
organizational grouping mroe clear.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Differential Revision: D19796584

Pulled By: ezyang

fbshipit-source-id: 3083abadb47e9f382b9fbe981af0b34203c6ea4d
2020-02-13 12:26:19 -08:00
2e88d3d703 [quant] Add Quantized BatchNorm2d module (#33109)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/33109

Test Plan:
python test/test_quantized_nn_mods.py ModuleAPITest.test_batch_norm

Imported from OSS

Differential Revision: D19861926

fbshipit-source-id: 67315e49b4b3577b965d422ca707d927d977feeb
2020-02-13 12:15:43 -08:00
d0435604a5 [quant] Add a quantized batch_norm operator (#33080)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33080

Quantized batch norm for cases where batch norm cannot be fused with conv.
AVX2 implementation is from Caffe2.

Test Plan:
python test/test_quantized.py TestQuantizedOps.test_batch_norm

Imported from OSS

Differential Revision: D19861927

fbshipit-source-id: bd8cd101fc063cb6358132ab7c651a160999293c
2020-02-13 12:15:38 -08:00
b28a834813 [codemod][lint][fbcode] Apply google-java-format
Test Plan: Sandcastle. Visual inspection.

Reviewed By: scottrice

Differential Revision: D19878711

fbshipit-source-id: be56f70b35825140676be511903e5274d1808f25
2020-02-13 12:14:14 -08:00
bf16688538 [JIT] peephole optimize values with NoneType (#33264)
Summary:
If a value has the type None, we can always replace it with a None constant.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33264

Differential Revision: D19878695

Pulled By: eellison

fbshipit-source-id: 5d0e7ffb37c5747997df093fec3183039d8dff4d
2020-02-13 12:03:49 -08:00
0c474d95d9 Remove Half support in binary cross entropy and some activation functions on CPU (#33206)
Summary:
For reasons similar to https://github.com/pytorch/pytorch/issues/33021. Note that the support of half type has
not been available in any releases yet so it should be safe to remove (All forward ones concerning this PR were added in daef363b15c8a3aaaed09892004dc655df76ff81 and 8cb05e72c69fdd837548419770f3f1ba9807c16d)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33206

Differential Revision: D19861137

Pulled By: ezyang

fbshipit-source-id: 38a3a398a716a782c26a611c56ddeab7eb7ac79e
2020-02-13 11:47:42 -08:00
946f3a9ed7 Refactor and add VS 14.16 and 2019 CI for Windows (#33117)
Summary:
Changes according to https://github.com/pytorch/pytorch/issues/18319.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33117

Differential Revision: D19858239

Pulled By: ezyang

fbshipit-source-id: f068d8505886b92c9388c9c636eab5bd20377ceb
2020-02-13 11:45:41 -08:00
2635055229 [ROCm] Enable 3D batch norms through MIOpen (#33262)
Summary:
Enable test for Caffe2
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33262

Differential Revision: D19880486

Pulled By: bddppq

fbshipit-source-id: af663a11137a53302e55198f38117ab6bdc9ec89
2020-02-13 11:29:51 -08:00
acea368095 Fix compilation error when buildng with FFMPEG (#27589)
Summary:
When building with FFMPEG, I encountered compilation error due to missing include/library.
I also find the change in video_input_op.h will improve build on Windows.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/27589

Differential Revision: D19700351

Pulled By: ezyang

fbshipit-source-id: feff25daa43bd2234d5e75c66b9865b672a8fb51
2020-02-13 11:23:48 -08:00
40246fa63c Gradient scaling API (#26512)
Summary:
This PR implements the gradient scaling API that mruberry, jjsjann123, ngimel, zdevito, gchanan and I have been discussing.  Relevant issue/RFC: https://github.com/pytorch/pytorch/issues/25081.

Volume-wise, this PR is mostly documentation and tests.  The Python API (found entirely in `torch/cuda/amp/amp_scaler.py`) is lightweight .  The exposed functions are intended to make the implementation and control flow of gradient scaling convenient, intuitive, and performant.

The API is probably easiest to digest by looking at the documentation and examples. `docs/source/amp.rst` is the homepage for the Automatic Mixed Precision package.  `docs/source/notes/amp_examples.rst` includes several examples demonstrating common but not-immediately-obvious use cases.  Examples are backed by tests in `test_cuda.py` (and thankfully the tests pass :P).

Two small utility kernels have been added in `native/cuda/AmpKernels.cu` to improve performance and avoid host-device synchronizations wherever possible.

Existing optimizers, both in the wild and in Pytorch core, do not need to change to use the scaling API.

However, the API was also designed to establish a contract between user scripts and optimizers such that writers of _new_ custom optimizers have the control points they need to implement fast, optionally sync-free updates.  User scripts that obey the scaling API can drop such custom optimizers in and reap performance benefits without having to change anything aside from the optimizer constructor itself.  [I know what the contract with custom optimizers should be](35829f24ef/torch/cuda/amp/amp_scaler.py (L179-L184)), but I'm waiting for review on the rest of the API before I go about documenting it (it will be given a dedicated section in `docs/source/notes/amp_examples.rst`.

Currently, the gradient scaling examples do not include the auto-casting API as discussed in https://github.com/pytorch/pytorch/issues/25081.  The gradient scaling API is intended to be orthogonal/modular relative to autocasting.  Without auto-casting the gradient scaling API is fully use-_able_, but not terribly use-_ful_, so it's up to you guys whether you want to wait until auto-casting is ready before merging the scaling API as well.

### Todo
- [ ] How do I get c10 registered status for my two custom kernels?  They're very simple.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/26512

Differential Revision: D19859905

Pulled By: mruberry

fbshipit-source-id: bb8ae6966214718dfee11345db824389e4286923
2020-02-13 11:06:06 -08:00
d613bd0522 [rpc][easy] move unnecessary python call directly to pybind (#33174)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33174

Closes https://github.com/pytorch/pytorch/issues/32780. It looks like
this is the only callsite where we do `_get_current_rpc_agent().foo()`, and we
can do this directly in the pybind layer to save some overhead.
ghstack-source-id: 98200664

Test Plan: All UTs should pass.

Differential Revision: D19828786

fbshipit-source-id: 5c34a96b5a970e57e6a1fdf7f6e54c1f6b88f3d8
2020-02-13 09:14:13 -08:00
0bf60e348f Revert D19878241: [pytorch][PR] Restore tests binary_macos_libtorch_2_7_cpu_build and binary_macos_li…
Test Plan: revert-hammer

Differential Revision:
D19878241

Original commit changeset: 07bce43e4667

fbshipit-source-id: 7f76717d73e264f30e8f56fb7bc38c8928dea092
2020-02-13 09:09:11 -08:00
ff7d147732 Restore tests binary_macos_libtorch_2_7_cpu_build and binary_macos_li… (#33291)
Summary:
Fix https://github.com/pytorch/pytorch/issues/33209
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33291

Differential Revision: D19878241

Pulled By: zhangguanheng66

fbshipit-source-id: 07bce43e466708dacd37b87ba3419435c6a7cde5
2020-02-13 08:48:16 -08:00
d554b112e3 Add histogram collection and weight prepacking utils (#33125)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33125

Provide histogram collection and weights prepacking interface for Dper to auto quantize the Ads models.

Test Plan:
buck test mode/opt deeplearning/numeric_suite/toolkit/test:int8_static_utils_test

buck test mode/opt deeplearning/numeric_suite/toolkit/test:histogram_utils_test

Reviewed By: amylittleyang

Differential Revision: D19794819

fbshipit-source-id: 6a4f4a6684da0977b7df2feed8a4b961db716da8
2020-02-13 01:40:20 -08:00
b98c7d34ed [TVM] Add clip op to c2_frontend (#33257)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/33257

Test Plan: buck test caffe2/caffe2/fb/tvm:test_tvm_transform

Reviewed By: yinghai

Differential Revision: D19866406

fbshipit-source-id: e903e15178af323d0bd1f804e09919023c0a2989
2020-02-12 22:30:43 -08:00
16685d93e9 [TVM] Add ReplaceNaN op (#33256)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/33256

Test Plan: buck test caffe2/caffe2/fb/tvm:test_tvm_transform

Reviewed By: yinghai

Differential Revision: D19851553

fbshipit-source-id: dee048c52ade16d9e531256b90e5d3391632cd8e
2020-02-12 22:29:30 -08:00
03e9b9ce18 [PyTorch BC] Remove unnecessary items in whitelist (#33247)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33247

remove stale items.

Test Plan: ci

Reviewed By: hl475

Differential Revision: D19861294

fbshipit-source-id: 2b112e5908c19a1ff190e3850085038065d21c53
2020-02-12 21:34:18 -08:00
e45343fa14 TORCH_INTERNAL_ASSERT_DEBUG_ONLY not eating message string (#33251)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33251

Somehow this was preventing `c10::Error` exceptions from ever being thrown on windows when `defined(NDEBUG) == false`. Kinda scary.

Test Plan: sandcastle green, made sure `intrusive_ptr_test.cpp` (givenStackObject_whenReclaimed_thenCrashes) passed inside ovrsource using `mode/win/dev-debug`

Reviewed By: malfet

Differential Revision: D19865667

fbshipit-source-id: c32d5752025c043e57d16c6d14a94b069bed0bc3
2020-02-12 21:23:34 -08:00
f61b45fc89 [jit] Support properties on Device (#32953)
Summary:
Stacked PRs
 * #32955 - [jit] Fix flipped PackedSequence outputs in script
 * **#32953 - [jit] Support properties on `Device`**

PyTorch devices have a `index` and `type` property. This PR adds support for both to TorchScript
](https://our.intern.facebook.com/intern/diff/19849320/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32953

Pulled By: driazati

Differential Revision: D19849320

fbshipit-source-id: ce845258c6110058dd9ea1f759ef74b7ed2e786e
2020-02-12 18:59:10 -08:00
806e7daa1f Rename TorchScript compiler to IR emitter to better reflect its function. (#33127)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/33127

Test Plan: Imported from OSS

Differential Revision: D19806503

Pulled By: ZolotukhinM

fbshipit-source-id: ab78bdbbac5f12dbcc6c2e2573f5862a16ffcf3d
2020-02-12 18:45:13 -08:00
91744907d4 SGD: updated step and class design (#32592)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/32592

Differential Revision: D19868154

Pulled By: anjali411

fbshipit-source-id: ce888efc68b1531d97e8b0abf2b146198e012d2f
2020-02-12 18:38:55 -08:00
914610d079 [pytorch][quant] Add assert for min, max, qmin, qmax for ChooseQuantizationParams (#32739)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32739

As Title says.
ghstack-source-id: 98061467

Test Plan: CI

Differential Revision: D19610810

fbshipit-source-id: f9621cd7d780769941ed77974b19c5226d4b2b30
2020-02-12 16:49:31 -08:00
bc0ab07064 Opitmize Unfold3d to improve performance of Conv3d (#33191)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33191

Opitmize Unfold3d to improve performance of Conv3d forward

Test Plan: buck test mode/dev-nosan //caffe2/test:nn -- "Conv3d"

Reviewed By: houseroad

Differential Revision: D19821946

fbshipit-source-id: 937adafddb9a1aef5f1d1423dd99884c59e465f9
2020-02-12 16:34:55 -08:00
0e753b2818 Fix SIGABORT caused by double exception in PyTorchStreamReader when file not found. (#33243)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33243

If a file does not exist in an archive, PyTorchStreamReader throws an exception. However, when PyTorchStreamReader is destructed another exception is thrown while processing the first exception. As a result of this double exception there is SIGABORT.

Thanks dreiss for catching this bug and suggesting the fix. It happened when he used _load_for_mobile to load a torch script file without bytecode session. A unittest is added to test this case.

Test Plan: Imported from OSS

Differential Revision: D19859205

Pulled By: iseeyuan

fbshipit-source-id: 8f96b6256f1a1f933fce1c256d64604c7e9269e4
2020-02-12 16:27:15 -08:00
ac8511a21e Updating submodules
Summary:
GitHub commits:

927d8afa7a
e64508917b
40d690970f

Test Plan: n/a

Reviewed By: zpao

fbshipit-source-id: 9135af67550f83a598a0a0baa1f9f6b1e4311ddf
2020-02-12 15:43:34 -08:00
f9ad5528e0 Fix for rand_like as well. (#33095)
Summary:
This is a followup PR to https://github.com/pytorch/pytorch/issues/32830 This solves the same issue for RandLike which we saw in RandNLike
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33095

Reviewed By: hl475

Differential Revision: D19848625

Pulled By: houseroad

fbshipit-source-id: 147921becf79490027a93606d52c5bc41d9eaf7f
2020-02-12 14:54:39 -08:00
f045dab3dd Remove ImplicitTensorToNum (#32761)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32761

This replaces ImplicitTensorToNum with result-specific operators like
IntImplicit, FloatImplicit, or ScalarImplicit. Note that ScalarImplicit
was not correctly implemented before and this PR fixes the lapse.

This does not change on-disk serialization because these operators are not
serialized directly but written as eg. `annotated(int, foo)`.

Test Plan: Imported from OSS

Differential Revision: D19615385

Pulled By: zdevito

fbshipit-source-id: 48575f408e8219d2ec5b46936fc2aa691f283976
2020-02-12 14:49:07 -08:00
99349defc1 remove unnecessary Node* ops (#32760)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32760

Minor changes to the way ops are implemented to remove incidental use of Node*
in the operator implementation.

Current state for operators that previously took Node:

```
TBD:

USES NODE: prim::DifferentiableGraph(...) -> (...)
USES NODE: prim::profile(...) -> (...)
USES NODE: prim::FusionGroup(...) -> (...)
USES NODE: prim::PythonOp(...) -> (...)
USES NODE: prim::ImplicitTensorToNum(Tensor a) -> Scalar # next PR

Should be made interpreter primitives:

USES NODE: prim::TupleUnpack(...) -> (...)
USES NODE: prim::TupleSlice(...) -> (...)
USES NODE: prim::TupleConstruct(...) -> (...)
USES NODE: prim::ListUnpack(...) -> (...)
USES NODE: prim::ListConstruct(...) -> (...)
USES NODE: prim::DictConstruct(...) -> (...)
USES NODE: prim::Constant() -> (...)
USES NODE: prim::isinstance(...) -> (...)
USES NODE: prim::CreateObject(...) -> (...)
USES NODE: prim::fork(...) -> (...)
USES NODE: aten::warn(str message, *, int stacklevel=2) -> () # need stack level information, so ideally in interpreter so it can look at the stack

Should be made into vararg operators, i.e. the operators last argument should be an IValue
that contains the number of arguments.

USES NODE: prim::FusedConcat(...) -> (...)
USES NODE: prim::MMTreeReduce(...) -> (...)
USES NODE: prim::MMBatchSide(...) -> (...)
USES NODE: prim::ConstantChunk(...) -> (...)
USES NODE: prim::AutogradAnyNonZero(...) -> bool
USES NODE: prim::BroadcastSizes(...) -> (...)
USES NODE: prim::ChunkSizes(...) -> (...)
USES NODE: aten::format(str self, ...) -> str
USES NODE: prim::Print(...) -> (...)

fixed:

USES NODE: aten::extend(Tensor[](a!) self, Tensor [] other) -> ()
USES NODE: aten::copy(Tensor[](a) self) -> Tensor[]
USES NODE: aten::extend(int[](a!) self, int [] other) -> ()
USES NODE: aten::copy(int[](a) self) -> int[]
USES NODE: aten::extend(float[](a!) self, float [] other) -> ()
USES NODE: aten::copy(float[](a) self) -> float[]
USES NODE: aten::extend(bool[](a!) self, bool [] other) -> ()
USES NODE: aten::copy(bool[](a) self) -> bool[]
USES NODE: aten::extend(t[](a!) self, t [] other) -> ()
USES NODE: aten::copy(t[](a) self) -> t[]
USES NODE: aten::keys(Dict(str, t) self) -> str[](*)
USES NODE: aten::values(Dict(str, t) self) -> t[](*)
USES NODE: aten::dict((str, tVal)[] inputs) -> Dict(str, tVal)
USES NODE: aten::keys(Dict(int, t) self) -> int[](*)
USES NODE: aten::values(Dict(int, t) self) -> t[](*)
USES NODE: aten::dict((int, tVal)[] inputs) -> Dict(int, tVal)
USES NODE: aten::keys(Dict(float, t) self) -> float[](*)
USES NODE: aten::values(Dict(float, t) self) -> t[](*)
USES NODE: aten::dict((float, tVal)[] inputs) -> Dict(float, tVal)
USES NODE: aten::keys(Dict(Tensor, t) self) -> Tensor[](*)
USES NODE: aten::values(Dict(Tensor, t) self) -> t[](*)
USES NODE: aten::dict((Tensor, tVal)[] inputs) -> Dict(Tensor, tVal)
USES NODE: aten::test_vartype2(t a, t[] b) -> (t[])
USES NODE: aten::_ncf_unsqueeze(Tensor self, int ndim) -> Tensor
USES NODE: aten::_ncf_view(Tensor self, int[] input_shape, int normalized_ndim) -> Tensor
USES NODE: prim::is_none(int? a) -> bool
USES NODE: aten::__interpolate(Tensor input, int? size = None, float[]? scale_factor = None, str mode = 'nearest', bool? align_corners = None, bool? recompute_scale_factor = None) -> Tensor
USES NODE: aten::__interpolate(Tensor input, int[]? size = None, float[]? scale_factor = None, str mode = 'nearest', bool? align_corners = None, bool? recompute_scale_factor = None) -> Tensor
USES NODE: aten::__interpolate(Tensor input, int? size = None, float? scale_factor = None, str mode = 'nearest', bool? align_corners = None, bool? recompute_scale_factor = None) -> Tensor
USES NODE: aten::__interpolate(Tensor input, int[]? size = None, float? scale_factor = None, str mode = 'nearest', bool? align_corners = None, bool? recompute_scale_factor = None) -> Tensor
USES NODE: aten::sorted(t[](a) self) -> (t[])
USES NODE: aten::sort(t[](a!) self, bool reverse=False) -> ()
USES NODE: aten::test_vartype(t[] a, t b) -> (t)
USES NODE: prim::unchecked_unwrap_optional(t(a)? optional) -> t(a)
USES NODE: prim::unchecked_cast(...) -> (...)
USES NODE: aten::dict() -> Dict(str, Tensor)
USES NODE: prim::Load(...) -> (...)
USES NODE: prim::Store(...) -> (...)
USES NODE: prim::Drop(...) -> (...)
USES NODE: aten::tensor(t[] data, *, ScalarType? dtype=None, Device? device=None, bool requires_grad=False) -> Tensor
USES NODE: aten::as_tensor(t[] data, *, ScalarType? dtype=None, Device? device=None) -> Tensor
```

Test Plan: Imported from OSS

Differential Revision: D19615387

Pulled By: zdevito

fbshipit-source-id: 95298c3c4249b9f812c332d13f0fb79daeecb662
2020-02-12 14:49:02 -08:00
72a00a8a9c Remove Node dependencies from operator.h (#32682)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32682

This moves code around so that operator.h/cpp no longer requires a full
definition of Node* nor does it include alias analysis or the pretty printer.

This should make it possible to include in the mobile build.

Functionality for checking if operators match Node and to look up
and operator for a Node have moved to the Node object.

Test Plan: Imported from OSS

Differential Revision: D19615386

Pulled By: zdevito

fbshipit-source-id: e38bdf29971183597ef940d061c06ba56e71d9c5
2020-02-12 14:47:26 -08:00
ab14375b08 Workaround for CUDA10.2.89 CUDA extension compilation error (#33230)
Summary:
Fixes: https://github.com/pytorch/pytorch/issues/33203
PR based on https://github.com/mpark/variant/pull/73

Verified locally on CUDA10.2.89 and 10.1.243

Thanks ngimel for the hint and gridley for the initial fix in the variant repo! :)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33230

Differential Revision: D19858083

Pulled By: ngimel

fbshipit-source-id: b9438084f5688712c6aa6b17813c68ccde237bbb
2020-02-12 14:23:30 -08:00
40265e2d66 prevent various warnings related to undef and redef (#33196)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/33196

Test Plan: Sandcastle green

Reviewed By: malfet

Differential Revision: D19842268

fbshipit-source-id: 47bc3d7a75e803041491e11a648b4a9e7d9cc72c
2020-02-12 13:28:35 -08:00
323b0e0a0f fix #30480 torch.normal shape checking is broken (#32243) (#33050)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33050

Following what gchanan proposed in #30480
- If the (logical) shapes of mean and std are broadcastable, we broadcast them for the output
  Done in tensor iterator already.
- If the (logical) shapes of mean and std are not broadcastable and they have the same number of elements, we fall back to the old behavior (pick the shape of mean)
  Done by reshape std to the same shape of mean.
- If the (logical) shapes of mean and std are not broadcastable and don't have the same number of elements, we error out.
  Done by tensor iterator already.

Test Plan: Imported from OSS

Differential Revision: D19771186

Pulled By: glaringlee

fbshipit-source-id: a0b71063c7f5fdda2d4ceb84e06384414d7b4262
2020-02-12 12:43:09 -08:00
2e9b7c5fe1 Migrate dist from TH to ATen(CPU, CUDA) (#29714)
Summary:
[https://github.com/pytorch/pytorch/issues/24691](https://github.com/pytorch/pytorch/issues/24691)
[https://github.com/pytorch/pytorch/issues/24551](https://github.com/pytorch/pytorch/issues/24551)

Benchmark:

**Speed**
```python
import time, sys
import torch
import math

inf = math.inf

torch.manual_seed(0)
devices = ["cpu", "cuda"]
ps = [0, 1, 2, 3, 4, inf, -inf]

# Warm up
for device in devices:
    for n in [1, 10, 100, 1000]:
        x = torch.randn(100, n, requires_grad=False, device=device)
        y = torch.randn(100, n, requires_grad=False, device=device)
        for i in range(1000):
            for p in ps:
                dist_xy = torch.dist(x, y, p)

for device in devices:
    print('On {}'.format(device))
    for n in [1, 10, 100, 1000]:
        total_time = 0
        x = torch.randn(100, n, requires_grad=False, device=device)
        y = torch.randn(100, n, requires_grad=False, device=device)
        for i in range(10000):
            for p in ps:
                t1 = time.time()
                dist_xy = torch.dist(x, y, p)
                t2 = time.time()
                total_time += (t2 - t1)
        average_time = total_time / 10000 / len(ps) * 1000
        print("input size(100, %d) average time is %.8f (ms)." % (n, average_time))
```

Output
Before:
```shel
On cpu
input size(100, 1) average time is 0.0079491 (ms).
input size(100, 10) average time is 0.0364167 (ms).
input size(100, 100) average time is 0.3120752 (ms).
input size(100, 1000) average time is 3.0605820 (ms).
On cuda
input size(100, 1) average time is 0.04745627 (ms).
input size(100, 10) average time is 0.04919453 (ms).
input size(100, 100) average time is 0.06601572 (ms).
input size(100, 1000) average time is 0.07849015 (ms).
```

After:
```shell
On cpu
input size(100, 1) average time is 0.0099936 (ms).
input size(100, 10) average time is 0.0340414 (ms).
input size(100, 100) average time is 0.2793379 (ms).
input size(100, 1000) average time is 0.7858076 (ms).
On cuda
input size(100, 1) average time is 0.04410237 (ms).
input size(100, 10) average time is 0.03326339 (ms).
input size(100, 100) average time is 0.03314828 (ms).
input size(100, 1000) average time is 0.03990038 (ms).
```

**Precision**

```python
for device in devices:
    torch.manual_seed(0)
    print('On {}'.format(device))
    for n in [1, 10, 100, 1000]:
        x = torch.randn(100, n, requires_grad=False).to(device)
        y = torch.randn(100, n, requires_grad=False).to(device)
        for p in ps:
            dist_xy_float = torch.dist(x, y, p)
            dist_xy_double = torch.dist(x.double(), y.double(), p)
            difference = torch.abs(dist_xy_double - dist_xy_float)
            print('input size (100, {}), p: {}, float: {}, double: {}, difference: {}'.format(n, p, dist_xy_float, dist_xy_double, difference))
```

Part of [output](https://gist.github.com/rivergold/dd95014dc7f163b22f72699d1134cdd2)
Before:
```shell
On cpu
input size (100, 100), p: 0, float: 10000.0, double: 10000.0, difference: 0.0
input size (100, 100), p: 1, float: 11474.1806640625, double: 11474.185433543797, difference: 0.00476948129653465
input size (100, 100), p: 2, float: 143.50729370117188, double: 143.5073391487937, difference: 4.5447621829453055e-05
input size (100, 100), p: 3, float: 36.045475006103516, double: 36.04550275212738, difference: 2.774602386779179e-05
input size (100, 100), p: 4, float: 18.796083450317383, double: 18.79609807865317, difference: 1.4628335787136848e-05
input size (100, 100), p: inf, float: 5.540258407592773, double: 5.5402586460113525, difference: 2.384185791015625e-07
input size (100, 100), p: -inf, float: 3.4868717193603516e-06, double: 3.4868717193603516e-06, difference: 0.0
On cuda
input size (100, 100), p: 0, float: 10000.0, double: 10000.0, difference: 0.0
input size (100, 100), p: 1, float: 11474.1865234375, double: 11474.185433543797, difference: 0.00108989370346535
input size (100, 100), p: 2, float: 143.50733947753906, double: 143.5073391487933, difference: 3.2874575595087663e-07
input size (100, 100), p: 3, float: 36.04550552368164, double: 36.045502752127405, difference: 2.7715542358919265e-06
input size (100, 100), p: 4, float: 18.796098709106445, double: 18.796098078653177, difference: 6.304532682577246e-07
input size (100, 100), p: inf, float: 5.540258407592773, double: 5.5402586460113525, difference: 2.384185791015625e-07
input size (100, 100), p: -inf, float: 3.4868717193603516e-06, double: 3.4868717193603516e-06, difference: 0.0
```

After
```shell
On cpu
input size (100, 100), p: 0, float: 10000.0, double: 10000.0, difference: 0.0
input size (100, 100), p: 1, float: 11474.1806640625, double: 11474.185433543797, difference: 0.00476948129653465
input size (100, 100), p: 2, float: 143.50729370117188, double: 143.5073391487937, difference: 4.5447621829453055e-05
input size (100, 100), p: 3, float: 36.045475006103516, double: 36.04550275212738, difference: 2.774602386779179e-05
input size (100, 100), p: 4, float: 18.796083450317383, double: 18.79609807865317, difference: 1.4628335787136848e-05
input size (100, 100), p: inf, float: 5.540258407592773, double: 5.5402586460113525, difference: 2.384185791015625e-07
input size (100, 100), p: -inf, float: 3.4868717193603516e-06, double: 3.4868717193603516e-06, difference: 0.0
On cuda
input size (100, 100), p: 0, float: 10000.0, double: 10000.0, difference: 0.0
input size (100, 100), p: 1, float: 11474.185546875, double: 11474.185433543797, difference: 0.00011333120346534997
input size (100, 100), p: 2, float: 143.50733947753906, double: 143.5073391487933, difference: 3.2874575595087663e-07
input size (100, 100), p: 3, float: 36.04550552368164, double: 36.045502752127405, difference: 2.7715542358919265e-06
input size (100, 100), p: 4, float: 18.796096801757812, double: 18.796098078653177, difference: 1.2768953645547754e-06
input size (100, 100), p: inf, float: 5.540258407592773, double: 5.5402586460113525, difference: 2.384185791015625e-07
input size (100, 100), p: -inf, float: 3.4868717193603516e-06, double: 3.4868717193603516e-06, difference: 0.0
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/29714

Differential Revision: D19769518

Pulled By: albanD

fbshipit-source-id: 69b79b64f1f190b410efe884662b6601e903eccf
2020-02-12 12:26:48 -08:00
97bf41ca22 Fix iOS x86_64 CI failure (#33194)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33194

### Summary

The iOS x86_64 job has been failed for a few days. I haven't found the root cause, but seems like updating the torchvision to its latest version can fix the problem

### Test Plan

- the x86_64 job works

Test Plan: Imported from OSS

Differential Revision: D19845079

Pulled By: xta0

fbshipit-source-id: 5034e252600b6704b860d68c371a65bef4cf37fc
2020-02-12 11:07:48 -08:00
87640570b3 Make CUDA OOM error a type (#33056)
Summary:
There are cases when we want to recover from CUDA OOM, for example, some cuDNN algorithms use huge workspace and we want to recover from OOM to pick a different algorithm, in such cases, there is no reason to catch all errors.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33056

Differential Revision: D19795359

Pulled By: ezyang

fbshipit-source-id: a34e23bf6d172dc0257389251dafef5b38d27d2b
2020-02-12 10:45:40 -08:00
a389f8fa18 Revert D18912680: Prepare templates
Test Plan: revert-hammer

Differential Revision:
D18912680

Original commit changeset: 9e3828e42ee5

fbshipit-source-id: 9ef81991394f4e36f0652dfe594d5122969bd9cf
2020-02-12 10:39:09 -08:00
3cfea39968 Document how BCELoss avoids infinite results (#33160)
Summary:
Issue https://github.com/pytorch/pytorch/issues/31453
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33160

Differential Revision: D19835527

Pulled By: albanD

fbshipit-source-id: 82fd2dd46ffbc87e90ca8e100db411b6ff6bfe32
2020-02-12 07:56:19 -08:00
05281a5671 Add nice error message if missing overrides in custom autograd.Function
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/33142

Test Plan: Imported from OSS

Differential Revision: D19815786

Pulled By: albanD

fbshipit-source-id: 5513d900c7b711b625383686fcf03f822ab7ea80
2020-02-12 07:55:06 -08:00
09915ad570 [TensorBoard] Correct typo and wrap dataformats. (#31604)
Summary:
Resolves issue https://github.com/pytorch/pytorch/issues/31603

- A minor spelling typo is corrected: "suitible" --> "suitable"
- A minor quality of life improvement is added: the data format strings are better rendered as fixed width to indicate that they are string constants.  "CHW" --> "`CHW`"
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31604

Differential Revision: D19697293

Pulled By: ezyang

fbshipit-source-id: ee38b0d4c9ca8a233ac9243c310d9a3b42ad6f32
2020-02-12 07:51:04 -08:00
c6e0360812 Minor change of docstring example of WeightedRandomSampler (#30846)
Summary:
Previous example
```python
>>> list(WeightedRandomSampler([0.1, 0.9, 0.4, 0.7, 3.0, 0.6], 5, replacement=True))
        [0, 0, 0, 1, 0]
```
may seem misleading according to provided weights.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30846

Differential Revision: D19697367

Pulled By: ezyang

fbshipit-source-id: 3d6e3cd0cecb5272a368707ba35bc7acdbd82c30
2020-02-12 07:46:39 -08:00
1767ae8daf [caffe2] remove dnnlowp log code (#33184)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33184

dnnlowp specific code shouldn't be in the default FC in the first place

Test Plan: Just removing #ifdef #endif

Reviewed By: jianyuh

Differential Revision: D19835301

fbshipit-source-id: 7880cf298bedb3f0bc407d140d342124663ea4a7
2020-02-12 00:47:09 -08:00
9d9fa2eace [2/3] Bind Bucketize to PyTorch (#33014)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33014

Export Bucketize to PyTorch.

Test Plan: buck test caffe2/caffe2/python/operator_test:torch_integration_test

Reviewed By: bddppq

Differential Revision: D19737534

fbshipit-source-id: be1c892bb8d01da9892f221f150f1a2788ac732e
2020-02-11 23:20:10 -08:00
47e589eb6e Disable flaky tests test_DistributedDataParallel and test_backend_group for ROCm (#33211)
Summary:
Getting intermittent error in CI runs:

**TestDistBackend.test_DistributedDataParallel**
```
02:36:32   File "/var/lib/jenkins/.local/lib/python3.6/site-packages/torch/serialization.py", line 442, in _legacy_save
02:36:32     pickler.dump(obj)
02:36:32 AttributeError: Can't pickle local object 'Module._replicate_for_data_parallel.<locals>.zero_grad'
```
Some CI runs where it failed:
https://ci.pytorch.org/jenkins/job/pytorch-builds/job/py3.6-clang7-rocmdeb-ubuntu16.04-test2/16163/console
https://ci.pytorch.org/jenkins/job/pytorch-builds/job/py3.6-clang7-rocmdeb-ubuntu16.04-test2/16165/console

**TestDistBackend.test_backend_group**
```
test_backend_group (__main__.TestDistBackend) ... Memory access fault by GPU node-5 (Agent handle: 0x265c670) on address 0x7fded754a000. Reason: Page not present or supervisor privilege.
```
Some CI runs where it failed:
https://ci.pytorch.org/jenkins/job/pytorch-builds/job/py3.6-clang7-rocmdeb-ubuntu16.04-test2/16288/console
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33211

Differential Revision: D19849089

Pulled By: bddppq

fbshipit-source-id: 5e997653cc344f4c6819d46bedc6d3bd75b5d854
2020-02-11 22:50:03 -08:00
5bc5dd58f3 [jit] fix a typo
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/29107

Differential Revision: D19698662

Pulled By: ezyang

fbshipit-source-id: e7eea3246008e2c6d560ff5e4d84b90f65ff1afd
2020-02-11 22:45:28 -08:00
b9a5353fee Move where cuda implementation to TensorIterator (#33228)
Summary:
Reopen of https://github.com/pytorch/pytorch/pull/32984
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33228

Differential Revision: D19850862

Pulled By: ngimel

fbshipit-source-id: b92446a49b4980188fa4788220a2164650e905c2
2020-02-11 22:28:27 -08:00
7863d2413d Updating submodules
Summary:
GitHub commits:

9fd0d1a3c7
bcaf9cdf1f
3e49249d30
98307ea1ec
f48ebb4d48
353f9c9f29
1caef25fc0
805ab665f2

Test Plan: n/a

Reviewed By: zpao

fbshipit-source-id: 609187c69ba2c6b31a05dcfdb1770054002ddb6e
2020-02-11 22:00:54 -08:00
d609497dde bulk_eval_collect_histograms
Summary:
Collect activation histograms along the model evaluation and aggregate all the histograms from multiple threads/readers into one file
The original functionality of bulk_eval workflow is still valid. The output predictions and extra blobs will be exported to a hive table, which will be very useful for numerical debugging.

Test Plan:
FBL
```flow-cli canary dper.workflows.bulk_eval.export --mode dbg --parameters-file experimental/summerdeng/sparsenn/bulk_eval_input_configs.json  --run-as-secure-group team_ai_system_sw_hw_co-design --entitlement gpu_prod --name "Histogram collection with caffe2 logging. Attach histogram observer to the predict net. Use small model 102343030. "
```
f163861773

When the flow is done, we can get all the histogram files under the specified dir. For example:
```
-rw-rw-r--. 1 185754 185754 3945012 Jan 23 09:45 /mnt/vol/gfsadslearner-frc3c01/fblearner_flow/users/summerdeng/sparsenn/bulk_eval.txt.0x7fcb6ca65cc0
-rw-rw-r--. 1 185754 185754 3945012 Jan 23 09:45 /mnt/vol/gfsadslearner-frc3c01/fblearner_flow/users/summerdeng/sparsenn/bulk_eval.txt.0x7fcb6cde8a80
-rw-rw-r--. 1 185754 185754 3945012 Jan 23 09:45 /mnt/vol/gfsadslearner-frc3c01/fblearner_flow/users/summerdeng/sparsenn/bulk_eval.txt.0x7fcb6d144840
-rw-rw-r--. 1 185754 185754 3945012 Jan 23 09:45 /mnt/vol/gfsadslearner-frc3c01/fblearner_flow/users/summerdeng/sparsenn/bulk_eval.txt.0x7fcb6d4a9600
-rw-rw-r--. 1 185754 185754 3945012 Jan 23 09:45 /mnt/vol/gfsadslearner-frc3c01/fblearner_flow/users/summerdeng/sparsenn/bulk_eval.txt.0x7fcb6da303c0
-rw-rw-r--. 1 185754 185754 3945012 Jan 23 09:45 /mnt/vol/gfsadslearner-frc3c01/fblearner_flow/users/summerdeng/sparsenn/bulk_eval.txt.0x7fcb6dd1c800
-rw-rw-r--. 1 185754 185754 3945012 Jan 23 09:45 /mnt/vol/gfsadslearner-frc3c01/fblearner_flow/users/summerdeng/sparsenn/bulk_eval.txt.0x7fcb6e0855c0
-rw-rw-r--. 1 185754 185754 3945012 Jan 23 09:45 /mnt/vol/gfsadslearner-frc3c01/fblearner_flow/users/summerdeng/sparsenn/bulk_eval.txt.0x7fcb6e3e0380
-rw-rw-r--. 1 185754 185754 3945012 Jan 23 09:45 /mnt/vol/gfsadslearner-frc3c01/fblearner_flow/users/summerdeng/sparsenn/bulk_eval.txt.0x7fcb6e95a140
-rw-rw-r--. 1 185754 185754 3945012 Jan 23 09:45 /mnt/vol/gfsadslearner-frc3c01/fblearner_flow/users/summerdeng/sparsenn/bulk_eval.txt.0x7fcb6eafcf00
-rw-rw-r--. 1 185754 185754 3945012 Jan 23 09:45 /mnt/vol/gfsadslearner-frc3c01/fblearner_flow/users/summerdeng/sparsenn/bulk_eval.txt.0x7fcb6ed1a100
-rw-rw-r--. 1 185754 185754 3945012 Jan 23 09:45 /mnt/vol/gfsadslearner-frc3c01/fblearner_flow/users/summerdeng/sparsenn/bulk_eval.txt.0x7fcb6f094ec0
-rw-rw-r--. 1 185754 185754 3945012 Jan 23 09:45 /mnt/vol/gfsadslearner-frc3c01/fblearner_flow/users/summerdeng/sparsenn/bulk_eval.txt.0x7fcb6f561c80
-rw-rw-r--. 1 185754 185754 3945012 Jan 23 09:45 /mnt/vol/gfsadslearner-frc3c01/fblearner_flow/users/summerdeng/sparsenn/bulk_eval.txt.0x7fcb6f783a40
-rw-rw-r--. 1 185754 185754 3945012 Jan 23 09:45 /mnt/vol/gfsadslearner-frc3c01/fblearner_flow/users/summerdeng/sparsenn/bulk_eval.txt.0x7fcb6fccb7c0
-rw-rw-r--. 1 185754 185754 3945012 Jan 23 09:45 /mnt/vol/gfsadslearner-frc3c01/fblearner_flow/users/summerdeng/sparsenn/bulk_eval.txt.0x7fcb7003d580
-rw-rw-r--. 1 185754 185754 3945012 Jan 23 09:45 /mnt/vol/gfsadslearner-frc3c01/fblearner_flow/users/summerdeng/sparsenn/bulk_eval.txt.0x7fcb703ae340
-rw-rw-r--. 1 185754 185754 3945012 Jan 23 09:45 /mnt/vol/gfsadslearner-frc3c01/fblearner_flow/users/summerdeng/sparsenn/bulk_eval.txt.0x7fcb7084ae80
-rw-rw-r--. 1 185754 185754 3945012 Jan 23 09:45 /mnt/vol/gfsadslearner-frc3c01/fblearner_flow/users/summerdeng/sparsenn/bulk_eval.txt.0x7fcb70bc1c40
-rw-rw-r--. 1 185754 185754 3945012 Jan 23 09:45 /mnt/vol/gfsadslearner-frc3c01/fblearner_flow/users/summerdeng/sparsenn/bulk_eval.txt.0x7fcb70f43a00
-rw-rw-r--. 1 185754 185754 3944091 Jan 23 09:45 /mnt/vol/gfsadslearner-frc3c01/fblearner_flow/users/summerdeng/sparsenn/bulk_eval.txt.0x7fcb70ff7680
-rw-rw-r--. 1 185754 185754 3945012 Jan 23 09:45 /mnt/vol/gfsadslearner-frc3c01/fblearner_flow/users/summerdeng/sparsenn/bulk_eval.txt.0x7fcb71361300
-rw-rw-r--. 1 185754 185754 3945012 Jan 23 09:45 /mnt/vol/gfsadslearner-frc3c01/fblearner_flow/users/summerdeng/sparsenn/bulk_eval.txt.0x7fcb716df0c0
-rw-rw-r--. 1 185754 185754 4024538 Jan 23 09:45 /mnt/vol/gfsadslearner-frc3c01/fblearner_flow/users/summerdeng/sparsenn/bulk_eval.txt.0x7fcb7199c780
-rw-rw-r--. 1 185754 185754 3944091 Jan 23 09:45 /mnt/vol/gfsadslearner-frc3c01/fblearner_flow/users/summerdeng/sparsenn/bulk_eval.txt.0x7fcb71b72f00
-rw-rw-r--. 1 185754 185754 3944091 Jan 23 09:45 /mnt/vol/gfsadslearner-frc3c01/fblearner_flow/users/summerdeng/sparsenn/bulk_eval.txt.0x7fcb72330000
-rw-rw-r--. 1 185754 185754 3944091 Jan 23 09:45 /mnt/vol/gfsadslearner-frc3c01/fblearner_flow/users/summerdeng/sparsenn/bulk_eval.txt.0x7fcb72598100
-rw-rw-r--. 1 185754 185754 3944091 Jan 23 09:45 /mnt/vol/gfsadslearner-frc3c01/fblearner_flow/users/summerdeng/sparsenn/bulk_eval.txt.0x7fcb7290d880
-rw-rw-r--. 1 185754 185754 3944091 Jan 23 09:45 /mnt/vol/gfsadslearner-frc3c01/fblearner_flow/users/summerdeng/sparsenn/bulk_eval.txt.0x7fcb72b03980
-rw-rw-r--. 1 185754 185754 3944091 Jan 23 09:45 /mnt/vol/gfsadslearner-frc3c01/fblearner_flow/users/summerdeng/sparsenn/bulk_eval.txt.0x7fcb72f1f160
-rw-rw-r--. 1 185754 185754 3944091 Jan 23 09:45 /mnt/vol/gfsadslearner-frc3c01/fblearner_flow/users/summerdeng/sparsenn/bulk_eval.txt.0x7fcb8bcee9e0
-rw-rw-r--. 1 185754 185754 3944091 Jan 23 09:45 /mnt/vol/gfsadslearner-frc3c01/fblearner_flow/users/summerdeng/sparsenn/bulk_eval.txt.0x7fd51b457260
-rw-rw-r--. 1 185754 185754 4026659 Jan 23 09:51 /mnt/vol/gfsadslearner-frc3c01/fblearner_flow/users/summerdeng/sparsenn/bulk_eval.txt.final
```

The aggregated histogram file is  /mnt/vol/gfsadslearner-frc3c01/fblearner_flow/users/summerdeng/sparsenn/bulk_eval.txt.final. It can be loaded to the following auto quant workflow for int8 static quantization.

######## Code refactoring ########

Moved the utility functions to process activation histograms to the deeplearning/numeric_suite/toolkit:hist_processor and add the dependency in dper.

We also had a hist_compiler in the caffe2/caffe2/fb/fbgemm/numerical_debugger/python_utils/hist_compiler.py. Also refactored the code to reuse the utility functions in deeplearning/numeric_suite/toolkit:hist_processor.

The histograms from bulk_eval and the hist_compiler are identical.
/mnt/vol/gfsadslearner-frc3c01/fblearner_flow/users/summerdeng/sparsenn/bulk_eval.txt.compiled.bak
/mnt/vol/gfsadslearner-frc3c01/fblearner_flow/users/summerdeng/sparsenn/bulk_eval.txt.final.bak

Reviewed By: hx89

Differential Revision: D19270090

fbshipit-source-id: c7ecb4f2bbf1ea725c52e903356ad9a7b9ad73ac
2020-02-11 21:39:47 -08:00
9e7638f7c1 "batchSize" was set but never used (#32294)
Summary:
fixes a compiler warning:
```
torch/aten/src/ATen/native/cuda/MaxUnpooling.cu.cc(402):
warning: variable "batchSize" was set but never used
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32294

Differential Revision: D19697277

Pulled By: ezyang

fbshipit-source-id: b9821be325826dc4785cad7994803b54f1711a0c
2020-02-11 21:28:49 -08:00
66ee4f1c81 [ROCm] Enable Bfloat16 type for activation and batch-norm
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/32065

Differential Revision: D19728858

Pulled By: ezyang

fbshipit-source-id: 8f828c558bfe6c5f43f476ff8a0f967341f8f351
2020-02-11 21:04:20 -08:00
f255b7a3ac Drop support of the build option USE_GLOO_IBVERBS (#33163)
Summary:
Two releases have passed since its deprecation:
8a026d4f74b71944ac2860c315996165a40f5626
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33163

Differential Revision: D19850713

Pulled By: ezyang

fbshipit-source-id: 30a60df470b88e8c40e33112296e437cde29c49f
2020-02-11 20:35:50 -08:00
1487137c5b add missing default value for LRScheduler.step() (#32411)
Summary:
see also other type errors in https://github.com/pytorch/pytorch/pull/30576 and https://github.com/pytorch/pytorch/pull/30441
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32411

Differential Revision: D19697245

Pulled By: ezyang

fbshipit-source-id: d0295d747541adec5d6fad646f4cf4bb2f04abf5
2020-02-11 20:34:33 -08:00
139afd0ea7 Fix link to py-spy content in contribution guide TOC (#31760)
Summary:
The extra dashes are breaking the link here
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31760

Differential Revision: D19697301

Pulled By: ezyang

fbshipit-source-id: 65de026b9016dc8689c9dac9efb8aafd00b535cd
2020-02-11 20:27:35 -08:00
74c8a8f7bc Revert D19825127: [pytorch][PR] Move where cuda implementation to TensorIterator
Test Plan: revert-hammer

Differential Revision:
D19825127

Original commit changeset: bbf4682349d9

fbshipit-source-id: 0c439b8c9a00a5aa46fd196396cf7cc83cddb1b4
2020-02-11 19:49:18 -08:00
000a5e2b7f bad tbb lambda capture, bad chunk size (#30352)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30352

1) tbb forwards us ident through parameter, we don't need to capture it.
2) tbb is being passed steps <= 0 which is bad.

Taken from TBB documentation:
```
The index type must be an integral type. The loop must not wrap around. The step value must be positive. If omitted, it is implicitly 1.
```

I have a build that uses `TBB_USE_DEBUG=1` and there are currently a lot of issues with PyTorch use.
Is TBB version not tested very much right now?
ghstack-source-id: 94459382

Test Plan: CI green

Differential Revision: D18666029

fbshipit-source-id: d5aa8327b03181d349e1964f9c8211298c433d6a
2020-02-11 18:46:32 -08:00
a23009f98f Quantized leaky relu
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/33004

Test Plan: Imported from OSS

Differential Revision: D19740193

Pulled By: z-a-f

fbshipit-source-id: 32542d5465db44190366a2f8b737305a03b5fa76
2020-02-11 17:56:02 -08:00
769abddfa3 Build ahead-of-time C++ extensions with ninja on windows
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/33084

Differential Revision: D19817361

Pulled By: ezyang

fbshipit-source-id: 95a6d0ffa9beb6885c8a41688621b33da51706ae
2020-02-11 17:50:09 -08:00
acd51e13f7 TorchScript add check if quantized
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/32890

Test Plan: Imported from OSS

Differential Revision: D19673463

Pulled By: z-a-f

fbshipit-source-id: 453ff662810845fcaeb8e6d5919afa8e2d395768
2020-02-11 17:38:49 -08:00
cb39a5400c Use C10_WARP_SIZE to fix functionality on HIP vs CUDA for batch_norm_backward_reduce (#33098)
Summary:
1. Use C10_WARP_SIZE instead of hardcoded value "32".
2. `getNumThreads` returns a minimum of 32 for CUDA, which is same as the warp size in CUDA. However, for HIP, it returns a minimum of 16, which is less than the warp size (64) in HIP. This creates an issue in the [reduce function](14548c2d5b/aten/src/ATen/native/cuda/Normalization.cuh (L115)) when it zeroes out the other entries in shared memory [here](14548c2d5b/aten/src/ATen/native/cuda/Normalization.cuh (L137)): since `blockDim.x` is at least equal to the warp size in CUDA, this never zeroes out `shared[0]`, but for HIP, since `blockDim.x` could be 16 or 32, which is less than the warp size (64), this results in `blockDim.x * blockDim.y` being potentially less than the warp size for small cases, which then zeroes out `shared[0]` as well. This results in an erroneous output of zero for the reduce function on ROCm (depending on how the block dimensions are set).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33098

Differential Revision: D19837355

Pulled By: bddppq

fbshipit-source-id: ea526acd82ec08b1acb25be860b7e663c38ff173
2020-02-11 16:47:22 -08:00
44723a1c24 [ONNX] Fix ONNX CI (#33200)
Summary:
Move the data to aws
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33200

Reviewed By: hl475

Differential Revision: D19843193

Pulled By: houseroad

fbshipit-source-id: bb0451d211cfc951ddb66264b92586c43b6e8841
2020-02-11 16:38:26 -08:00
af4d6120bd Temporarily disable failing 'binary_macos_libtorch_2_7_cpu_build' and… (#33207)
Summary:
… 'binary_macos_wheel_3_6_cpu_build' jobs
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33207

Differential Revision: D19844787

Pulled By: kostmo

fbshipit-source-id: d44a0e26bf76afe4a5f94d7f1ad2d558de6f5d47
2020-02-11 15:44:35 -08:00
04829e924a Update CPU threading doc (#33083)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33083

Added more recommendations, some notes and warning

Test Plan: cd docs ; make html

Differential Revision: D19829133

Pulled By: ilia-cher

fbshipit-source-id: b9fbd89f5875b3ce35cc42ba75a3b44bb132c506
2020-02-11 14:13:51 -08:00
6706c3f457 Prepare templates (#30982)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30982

This stack is a first step toward an effort to fix, clean up and simplify code generation logic. �Please see the master [task](https://github.com/pytorch/pytorch/issues/30405) to see related discussions and all the known issues.

Main focus of these changes is TensorOptions in code generation.
Goals:
- Remove TensorOptions from generated code wherever it's possible. Leave it only in python/C++ API layers.
- Refactor TensorOptions logic to a single place.
- Log all discovered issues.

Non goals:
- Fix Everything!
- Remove all the hacks in code generation scripts.
- Clean up and defector all code generation scripts.

-----------
In this PR:
Updating the templates.

-----------

Test Plan: Imported from OSS

Differential Revision: D18912680

Pulled By: izdeby

fbshipit-source-id: 9e3828e42ee5c3aefbf3729f4a8d6db813f2e7c3
2020-02-11 13:10:14 -08:00
45818a3de4 Remove some Half support in some binary CPU kernels (#33021)
Summary:
They were probably mistakenly added as we do not intend to support Half
on CPUs in general and in these situations Half type would probably be
significantly slower than their float and double counterpart due to the
lack of vectorization and the need of additional casting.

cc XiaobingSuper
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33021

Differential Revision: D19795152

Pulled By: VitalyFedyunin

fbshipit-source-id: b19796db88880a46557e1b2fd06e584d46093562
2020-02-11 12:54:47 -08:00
7b50e76255 optimize cat performance on CPU with TensorIterator (#30806)
Summary:
This PR aims at improving `cat` performance on CPU.
Current `cat` logic from `TH` module has no parallelization when the input tensor array are all contiguous.
This code also try to reuse the same `TensorIterator` as much as possible, in order to reduce overhead of creating `TensorIterator`, this is helpful when the slice of copy is not large enough.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30806

Differential Revision: D19275026

Pulled By: VitalyFedyunin

fbshipit-source-id: 756e9b86891f725c256b0a6981887ff06d88b053
2020-02-11 12:49:56 -08:00
ad90c97c0a Removes flaky check (#33146)
Summary:
Addresses https://github.com/pytorch/pytorch/issues/32949.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33146

Differential Revision: D19836001

Pulled By: mruberry

fbshipit-source-id: 773069ae0c181e1a050b65b888c87590c1dddb32
2020-02-11 12:21:07 -08:00
a64d0ffe81 Use int64 in pdist kernel to handle batches >= 46342 #30583 (#31593)
Summary:
Currently `torch.pdist` yields an illegal CUDA memory access for batch sizes >= 46342 as reported by SsnL in https://github.com/pytorch/pytorch/issues/30583.
Thanks for the minimal code reproduction, btw! ;)

Reason for this bug:
The calculation if `i` in the [`pdist_kerne_cuda_impl`](46ad80c839/aten/src/ATen/native/cuda/DistanceKernel.cu (L112)) might overflow, if a tensor with a `batch size >= 46342` is passed to `torch.pdist`.

Detailed description:
* `result` is resizes as ` n * (n - 1) / 2 = 1073767311` ([line of code](46ad80c839/aten/src/ATen/native/Distance.cpp (L140)))
* `grid` is initialized as `result.numel()` ([line of code](46ad80c839/aten/src/ATen/native/cuda/DistanceKernel.cu (L246)))
* `k` is assigned to the `blockIdx.x` as an `int32` ([line of code](46ad80c839/aten/src/ATen/native/cuda/DistanceKernel.cu (L108)))
* `i` is calculated using `2 * k >= 2147534622` ([line of code](46ad80c839/aten/src/ATen/native/cuda/DistanceKernel.cu (L112))), which overflows, since `2147534622 > 2147483647 (int32_max)`.

Using `const int64_t k = blockIdx.x;` would solve the illegal memory access. This seems also be done for [`cdist_kernel_cuda_impl`](46ad80c839/aten/src/ATen/native/cuda/DistanceKernel.cu (L198-L201)).

However, we might expect a slowdown, so I've timed the current PyTorch master vs. this PR:
(tested with `x = torch.randn(x.size(0), 128)` on a V100)

 |x.size(0) | int32 idx | int64 idx | slowdown |
 |----------|-----------|-----------|----------|
| 50000 | -              | 4.4460 | - |
| 25000 | 1.02522 | 1.10869 | 7.53% |
| 12500 | 0.25182 | 0.27277 | 7.68% |
| 6250 | 0.06291 | 0.06817 | 7.72% |
| 3125 | 0.01573 | 0.01704 | 7.69% |
| 1562 | 0.00393 | 0.00426 | 7.75% |

While checking the backward kernel, it seems I'm triggering another error with a size limit of
```python
x = torch.randn(1449, 1, device='cuda', requires_grad=True)
out = torch.pdist(x)
out.mean().backward()
> RuntimeError: CUDA error: invalid configuration argument
```
, while `[<=1448, 1]` works.

I'll take another look at this issue. Let me know, if the potential fix should go into this PR or if I should open a new issue.

CC ngimel, csarofeen
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31593

Differential Revision: D19825571

Pulled By: ngimel

fbshipit-source-id: ace9ccab49f3cf0ce894cdb6daef0795e2e8ec03
2020-02-11 12:00:39 -08:00
367488b001 Move where cuda implementation to TensorIterator (#32984)
Summary:
`where` is special because the arguments do not have the same type, which does not satisfy the assumption in modern https://github.com/pytorch/pytorch/pull/32383. I migrate it to TensorIterator so that there is something to test that this case is not broken. Currently, this case fallback to using legacy (not vectorized, not unrolled) code. It should be supported in the future when I cleanup `Loops.cuh`.

I also move some sharing part of `CUDALoops.cuh` and `ROCmLoops.cuh` into `Loops.cuh` so that to logic for checking whether `func_t` has the same arg types could be shared.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32984

Differential Revision: D19825127

Pulled By: ngimel

fbshipit-source-id: bbf4682349d96b4480c4d657f3c18a3a67a9bf17
2020-02-11 11:10:06 -08:00
31370949be Add zero_mask function for vectorized functions. (#32985)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32985

This can be useful in many situations to decide whether all elements are
zeros or non-zeros, such as elu as shown in #32986 .

Test Plan: Imported from OSS

Differential Revision: D19794549

Pulled By: VitalyFedyunin

fbshipit-source-id: 1be1c863d69b9a19fdcfcdd7cb52343066f740d3
2020-02-11 11:01:29 -08:00
855ee6446f Revert D18749922: [pytorch] Migrating index_add cuda to ATen
Test Plan: revert-hammer

Differential Revision:
D18749922

Original commit changeset: d243be43a3b6

fbshipit-source-id: 15dafa644d84ff8803bd9ab3cdd40e12d805924a
2020-02-11 10:33:20 -08:00
857bae39e0 Updated DispatchKeyExtractor to expect TensorOptions (#30981)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30981

This stack is a first step toward an effort to fix, clean up and simplify code generation logic. �Please see the master [task](https://github.com/pytorch/pytorch/issues/30405) to see related discussions and all the known issues.

Main focus of these changes is TensorOptions in code generation.
Goals:
- Remove TensorOptions from generated code wherever it's possible. Leave it only in python/C++ API layers.
- Refactor TensorOptions logic to a single place.
- Log all discovered issues.

Non goals:
- Fix Everything!
- Remove all the hacks in code generation scripts.
- Clean up and defector all code generation scripts.

-----------
In this PR:
Extended DispatchKeyExtractor logic to expect TensorOptions.

-----------

Test Plan: Imported from OSS

Differential Revision: D18912684

Pulled By: izdeby

fbshipit-source-id: 25cf1c397caa14272ca65b4003f1f03ff282ea77
2020-02-11 10:09:08 -08:00
e7f0b15473 Remove return value for __exit__ (#32997)
Summary:
When an error is raised and `__exit__` in a context manager returns `True`, the error is suppressed; otherwise the error is raised. No return value should be given to maintain the default behavior of context manager.

Fixes https://github.com/pytorch/pytorch/issues/32639. The `get_lr` function was overridden with a function taking an epoch parameter, which is not allowed. However, the relevant error was not being raised.

```python
In [1]: import torch
   ...:
   ...: class MultiStepLR(torch.optim.lr_scheduler._LRScheduler):
   ...:     def __init__(self, optimizer, gamma, milestones, last_epoch = -1):
   ...:         self.init_lr = [group['lr'] for group in optimizer.param_groups]
   ...:         self.gamma = gamma
   ...:         self.milestones = milestones
   ...:         super().__init__(optimizer, last_epoch)
   ...:
   ...:     def get_lr(self, step):
   ...:         global_step = self.last_epoch #iteration number in pytorch
   ...:         gamma_power = ([0] + [i + 1 for i, m in enumerate(self.milestones) if global_step >= m])[-1]
   ...:         return [init_lr * (self.gamma ** gamma_power) for init_lr in self.init_lr]
   ...:
   ...: optimizer = torch.optim.SGD([torch.rand(1)], lr = 1)
   ...: scheduler = MultiStepLR(optimizer, gamma = 1, milestones = [10, 20])
```
```
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-1-7fad6ba050b0> in <module>
     14
     15 optimizer = torch.optim.SGD([torch.rand(1)], lr = 1)
---> 16 scheduler = MultiStepLR(optimizer, gamma = 1, milestones = [10, 20])

<ipython-input-1-7fad6ba050b0> in __init__(self, optimizer, gamma, milestones, last_epoch)
      6         self.gamma = gamma
      7         self.milestones = milestones
----> 8         super().__init__(optimizer, last_epoch)
      9
     10     def get_lr(self, step):

~/anaconda3/envs/pytorch/lib/python3.7/site-packages/torch/optim/lr_scheduler.py in __init__(self, optimizer, last_epoch)
     75         self._step_count = 0
     76
---> 77         self.step()
     78
     79     def state_dict(self):

~/anaconda3/envs/pytorch/lib/python3.7/site-packages/torch/optim/lr_scheduler.py in step(self, epoch)
    141                 print("1a")
    142                 # try:
--> 143                 values = self.get_lr()
    144                 # except TypeError:
    145                     # raise RuntimeError

TypeError: get_lr() missing 1 required positional argument: 'step'
```

May be related to https://github.com/pytorch/pytorch/issues/32898.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32997

Differential Revision: D19737731

Pulled By: vincentqb

fbshipit-source-id: 5cf84beada69b91f91e36b20c3278e9920343655
2020-02-11 09:27:29 -08:00
6c0dc66cb4 [caffe2] use JIT'ed fp32 SLS (#33123)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33123

Pull Request resolved: https://github.com/pytorch/pytorch/pull/32413

Use JIT'ed fp32 SLS in Caffe2 operators

Test Plan:
```
./fblearner/flow/run_integration_tests --regex dper.workflows.canary.canary_workflow --wait
```
f167043951 was killed due to 3hr timeout instead of failed.

Reviewed By: jianyuh

Differential Revision: D19680711

fbshipit-source-id: efaca333edcfeab0007ad88f4f5168b2229e7e66
2020-02-11 08:59:17 -08:00
3655975565 Add allow_rebase_history flag and fix codegen functions for multiple views (#32790)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32790

Same as https://github.com/pytorch/pytorch/pull/31990 but without the first commit in the stack that is problematic for a lot of people.

Test Plan: Imported from OSS

Differential Revision: D19814116

Pulled By: albanD

fbshipit-source-id: d104911a5b098a5807b4bc08b69803ebd4f69fa6
2020-02-11 07:16:02 -08:00
330d051bd5 [pytorch] Migrating index_add cuda to ATen (#30573)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30573

Mostly just moved code.
Index dim and number of indices checks are added to make checks idential to index_add_cpu_
ghstack-source-id: 98010129

Test Plan: existing tests

Differential Revision: D18749922

fbshipit-source-id: d243be43a3b6a9b9591caf0c35ef2fb6ec0d3ead
2020-02-11 06:03:53 -08:00
9857d9b4cd fix gather regression by not materializing loop vars in the error mes… (#33108)
Summary:
…sage

Per title, fixes regression reported in https://github.com/pytorch/pytorch/issues/32425. cc nikitaved
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33108

Differential Revision: D19816116

Pulled By: ngimel

fbshipit-source-id: 9f4a84c8e4533873b71bb7bbf3a7915b05308845
2020-02-10 18:27:02 -08:00
6f46962f21 [1/3] Bind IndexHash to PyTorch (#33015)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33015

Export IndexHash to PyTorch

Test Plan:
buck test caffe2/caffe2/python/operator_test:torch_integration_test

      ✓ caffe2/caffe2/python/operator_test:torch_integration_test-2.7 - test_index_hash_op (caffe2.caffe2.python.operator_test.torch_integration_test.TorchIntegration) 0.151 44/50 (passed)

Reviewed By: bddppq

Differential Revision: D19727301

fbshipit-source-id: a65c954539e81a15577fe5c3c0deb3614e983534
2020-02-10 17:47:38 -08:00
61ac14a483 Updating submodules
Summary:
GitHub commits:

543b39c9ad
38c2e0ee44
552c07c32b
4369f2c7bb
07dbb5d2f4

Test Plan: n/a

Reviewed By: zpao

fbshipit-source-id: 803108a618a5be9ea58a38644c851486bad3bfbc
2020-02-10 17:19:07 -08:00
a3e69d3405 Use bazelisk instead of specifying bazel version manually. (#33036)
Summary:
Bazelisk automatically reads `.bazelversion` file and install the required version of Bazel. This saves us from updating CI script everytime we need a Bazel upgrade.
Use clang-8 for consistency with pytorch/xla repo.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33036

Differential Revision: D19820819

Pulled By: ailzhang

fbshipit-source-id: 1560ec225cd037a811769a509a704b0df77ea183
2020-02-10 17:14:08 -08:00
524fe8a96c Updating submodules
Summary:
GitHub commits:

4bc5213b66
9ae570bb89
b2bc1da561
dcde8696bd

Test Plan: n/a

Reviewed By: zpao

fbshipit-source-id: c5ca30dab73f80cd13f5a5bf6e3867083b2512ac
2020-02-10 15:07:12 -08:00
d672779339 [CI][treehug] Disable xenial_py2.7 tests due to mypy min version py3.5
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/33159

Test Plan: Imported from OSS

Differential Revision: D19822400

Pulled By: IvanKobzarev

fbshipit-source-id: 8e7b561e6a6181ec1f9b6f56a539ddcb538b3858
2020-02-10 14:52:29 -08:00
495c1df510 [pytorch] convert code analyzer to a binary (#33102)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33102

Add a simple main() to build code analyzer as a binary. This enables
easier integration with FB internal build environment.
ghstack-source-id: 97958658

Test Plan: - CI

Differential Revision: D19798560

Pulled By: ljk53

fbshipit-source-id: 126230e3bf7568046a309e8a6785230f820e0222
2020-02-10 14:46:29 -08:00
e8c4f5a74b Temporarily disable failing iOS builds
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/33154

Differential Revision: D19820655

Pulled By: kostmo

fbshipit-source-id: fc3e22b1bf4ec112085ea846c3999efd0f3e26f3
2020-02-10 13:47:57 -08:00
3bde97d5a5 Move a resize from codegen to code.
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/33024

Test Plan: Imported from OSS

Differential Revision: D19774147

Pulled By: gchanan

fbshipit-source-id: 08cb099f1695b28117e4236e214976b548aec7a1
2020-02-10 12:47:14 -08:00
3c4cec56aa Enable test_distributed for ROCm but only with nccl backend [REDUX] (#32551)
Summary:
This is a redux of the original PR https://github.com/pytorch/pytorch/issues/28814 which was reverted in PR https://github.com/pytorch/pytorch/issues/29736 due to test_DistributedDataParallel being suspected as being flaky. Further investigation revealed it wasn't flakiness, but a bug in the PyTorch source code which has been now fixed in PR https://github.com/pytorch/pytorch/issues/32356. This PR is another attempt at enabling the test_distributed unit test suite only for the nccl backend.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32551

Differential Revision: D19729966

Pulled By: bddppq

fbshipit-source-id: 12a0d850991a903cc7723d63693b6157071d7115
2020-02-10 12:42:36 -08:00
f4fbe9549d Revert D19800021: [pytorch][PR] Improve error message for assertWarnsRegex
Test Plan: revert-hammer

Differential Revision:
D19800021

Original commit changeset: 1c31ae785c8f

fbshipit-source-id: d7b340d678562c25a84d48be66c576075000b50d
2020-02-10 12:17:52 -08:00
6be4ec100f [pytorch] Elide more Thrift Tensor send copies. (#31998)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31998

This change builds on recent torch::from_blob() changes to avoid Tensor
copies on send in more cases.

Particularly, this change adds an enabled option to assume if the Tensor
Storage's DataPtr has a non-trivial deleter, then the Tensor does in fact
manage the underlying memory. And hence we can reference the Tensor's Storage
via an IOBuf that is referenced while sending, saving a Tensor copy.

We add appropriate test cases, particularly re: torch::from_blob() which
would have been problematic would recent changes.
ghstack-source-id: 97778619

Test Plan: buck test mode/dev caffe2/torch/fb/distributed/wireSerializer/test/...

Reviewed By: satgera

Differential Revision: D19306682

fbshipit-source-id: 05f56efb2d5d6279ae4b54dfcbba0f729c2c13fa
2020-02-10 11:34:33 -08:00
ebed008dd4 Correct /MP usage in MSVC (#33120)
Summary:
## Several flags
`/MP[M]`: It is a flag for the compiler `cl`. It leads to object-level multiprocessing. By default, it spawns M processes where M is the number of cores on the PC.
`/maxcpucount:[M]`: It is a flag for the generator `msbuild`. It leads to project-level multiprocessing. By default, it spawns M processes where M is the number of cores on the PC.
`/p:CL_MPCount=[M]`: It is a flag for the generator `msbuild`. It leads the generator to pass `/MP[M]` to the compiler.
`/j[M]`: It is a flag for the generator `ninja`. It leads to object-level multiprocessing. By default, it spawns M processes where M is the number of cores on the PC.

## Reason for the change
1. Object-level multiprocessing is preferred over project-level multiprocessing.
2. ~For ninja, we don't need to set `/MP` otherwise M * M processes will be spawned.~ Actually, it is not correct because in ninja configs, there are only one source file in the command. Therefore, the `/MP` switch should be useless.
3. For msbuild, if it is called through Python configuration scripts, then `/p:CL_MPCount=[M]` will be added, otherwise, we add `/MP` to `CMAKE_CXX_FLAGS`.
4. ~It may be a possible fix for https://github.com/pytorch/pytorch/issues/28271, https://github.com/pytorch/pytorch/issues/27463 and https://github.com/pytorch/pytorch/issues/25393. Because `/MP` is also passed to `nvcc`.~ It is probably not true. Because `/MP` should not be effective given there is only one source file per command.

## Reference
1. https://docs.microsoft.com/en-us/cpp/build/reference/mp-build-with-multiple-processes?view=vs-2019
2. https://github.com/Microsoft/checkedc-clang/wiki/Parallel-builds-of-clang-on-Windows
3. https://blog.kitware.com/cmake-building-with-all-your-cores/
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33120

Differential Revision: D19817227

Pulled By: ezyang

fbshipit-source-id: f8d01f835016971729c7a8d8a0d1cb8a8c2c6a5f
2020-02-10 11:29:25 -08:00
9d94f56ce0 Backward operation of torch.eig for real eigenvalues (#33090)
Summary:
Another pull request to follow up issue https://github.com/pytorch/pytorch/issues/32531.
Here I implemented the backward operation for `torch.eig` with a condition that all the eigenvalues are real.

This pull request is independent of my another pull request https://github.com/pytorch/pytorch/issues/32932, which means that there is no dependency between this PR and my another PR.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33090

Differential Revision: D19814347

Pulled By: albanD

fbshipit-source-id: 2fae30964e97987abb690544df8240aedeae56e8
2020-02-10 09:52:56 -08:00
c917a247a8 Improve error message for assertWarnsRegex (#33099)
Summary:
`assertWarnsRegex` now prints out any warnings that it caught while failing to find a matching warning. This makes it easier to debug tests by just looking at the CI logs.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33099

Differential Revision: D19800021

Pulled By: ezyang

fbshipit-source-id: 1c31ae785c8ffc5d47619aff6597e479263be2de
2020-02-10 07:27:59 -08:00
3e8d813263 Add more checks to custom Function (#33069)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33069

This PR adds the following:
- Warn when a non-input Tensor is given to `mark_dirty()` as it is not needed.
- Raise an error if we modify inplace an input that is a view and that we have multiple output. This setting is not handled by `CopySlices` and will raise a cryptic error during the backward.
- Raise an error if an input is modified inplace but not returned. That will prevent the graph rewrite from being done correctly.

Test Plan: Imported from OSS

Differential Revision: D19791563

Pulled By: albanD

fbshipit-source-id: 4d8806c27290efe82ef2fe9c8c4dc2b26579abd1
2020-02-10 07:25:24 -08:00
e1c53a5c86 Fix version counter bump in cpp Function (#33068)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33068

The version counter is already tracked if we use pytorch's functions but not if the user unpack the Tensor and modifies it by hand or with a third party library.

Test Plan: Imported from OSS

Differential Revision: D19791564

Pulled By: albanD

fbshipit-source-id: a73c0f73d8fd0c0e5bf838f14bed54fa66937840
2020-02-10 07:22:29 -08:00
efba630287 Issue a warning when zero_grad is used in DataParallel (#33064)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/31768, second attempt of https://github.com/pytorch/pytorch/issues/32870

DataParallel creates replicas of the original `nn.Module` with the parameters duplicated onto the destination devices. Calling `backwards` will propagate gradients onto the original module parameters but calling `zero_grad` on the replica module doesn't clear the gradients from the parent module. However, any replica using backwards was broken anyway since the replica's parameters are not leaf nodes in autograd. So, we should issue a warning.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33064

Differential Revision: D19790178

Pulled By: albanD

fbshipit-source-id: 886f36640acef4834a6fa57a26ce16b42ff0e9ad
2020-02-10 07:04:27 -08:00
e2f1288514 Add utils to inspect fp16/int8 packed weights (#32979)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32979

Since we use prepacked weights in the Fp16 FCs and future Int8 FCs in production Ads models, we provide the python utils to inspect the unpacked format of the weights for debugging purpose. The main interfaces are the following:

```
from deeplearning.numeric_suite.toolkit import packed_weights_inspector
# inspect fp16 packed weights
unpacked_fp16_weights = packed_weights_inspector.extract_fp16_fc_packed_weights(fp16_weight_blob_name)

# inspect int8 packed weights
unpacked_int8_weights, qparams = packed_weights_inspector.extract_int8_fc_packed_weights(int8_weight_blob_name)
```

Test Plan:
```
buck test mode/opt deeplearning/numeric_suite/toolkit/test:packed_weights_inspector_test
```

Reviewed By: amylittleyang

Differential Revision: D19724474

fbshipit-source-id: e937672b3722e61bc44c2587aab2288a86aece9a
2020-02-08 18:18:56 -08:00
6249d7302b [ONNX] Fix export for avg_pool with default stride (#33017)
Summary:
If using nn.functional avg_pool, stride is an optional arg. If not provided, it is set to kernel_size.
This PR fixes the export of avg_pool with default stride.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33017

Reviewed By: hl475

Differential Revision: D19759604

Pulled By: houseroad

fbshipit-source-id: b0352db6fbaf427f4cff9ba8a942efdeb39b6f02
2020-02-07 22:46:46 -08:00
0e29e9e0f6 Re-enable internal test runs
Summary:
Fix internal error message due to old version of hypothesis
   test_suite = self.load_tests()
  File "/data/sandcastle/boxes/eden-trunk-hg-fbcode-fbsource/fbcode/buck-out/dev/gen/caffe2/test/quantization#binary,link-tree/__fb_test_main__.py", line 678, in load_tests
    suite = loader.load_all()
  File "/data/sandcastle/boxes/eden-trunk-hg-fbcode-fbsource/fbcode/buck-out/dev/gen/caffe2/test/quantization#binary,link-tree/__fb_test_main__.py", line 467, in load_all
    __import__(module_name, level=0)
  File "/data/sandcastle/boxes/eden-trunk-hg-fbcode-fbsource/fbcode/buck-out/dev/gen/caffe2/test/quantization#binary,link-tree/test_quantization.py", line 45, in <module>
    hu.assert_deadline_disabled()
  File "/data/sandcastle/boxes/eden-trunk-hg-fbcode-fbsource/fbcode/buck-out/dev/gen/caffe2/test/quantization#binary,link-tree/torch/testing/_internal/hypothesis_utils.py", line 322, in assert_deadline_disabled
    assert settings().deadline is None
  File "/data/sandcastle/boxes/eden-trunk-hg-fbcode-fbsource/fbcode/buck-out/dev/gen/caffe2/test/quantization#binary,link-tree/hypothesis/_settings.py", line 127, in __getattr__
    raise AttributeError('settings has no attribute %s' % (name,))
AttributeError: settings has no attribute deadline

Test Plan: buck test mode/dev //caffe2/test:quantization -- --run-disabled runs successfully

Differential Revision: D19795232

fbshipit-source-id: ef1d8be20b4be30e1cfad4cd5019c4779a5f4568
2020-02-07 18:08:18 -08:00
17d4ef9e9e Support using scalar tensor for split (#32493)
Summary:
split requires an int input, however in tracing operators such as
size(axis) return a tensor, which is different behavior than when not
tracing. As such need to modify split to handle these cases.

Fixes https://github.com/pytorch/pytorch/issues/27551
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32493

Reviewed By: hl475

Differential Revision: D19538254

Pulled By: houseroad

fbshipit-source-id: c8623009de5926aa38685e08121f4b48604bd8c0
2020-02-07 17:16:43 -08:00
7314f1c281 [torch/multiprocessing] Update documentation indicating that start_method is ignored for mp.spawn() (#33070)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33070

`start_method` parameter is intentionally ignored for `mp.spawn()`. Document this fact and point the user to `start_processes` if they want to use a different `start_method`.

Test Plan:
Warning message looks like:
```
main.py:8: UserWarning: This method only supports start_method=spawn (got: fork).
To use a different start_method use:
         torch.multiprocessing.start_process(...)
  warnings.warn(msg)
```

Reviewed By: ailzhang

Differential Revision: D19780235

fbshipit-source-id: 4599cd18c3ba6cc401810efe4f390290ffa8023b
2020-02-07 15:26:00 -08:00
c6fa6d82ae move Decompose before profiling to prevent clearing shape info
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/33100

Differential Revision: D19793346

Pulled By: Krovatkin

fbshipit-source-id: fdc5927f4970eabbb5a8f62a499d5b79117af2a9
2020-02-07 14:04:40 -08:00
868db903ae ONNX support for torch.take (#33061)
Summary:
Adding ONNX export support for torch.take()
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33061

Reviewed By: hl475

Differential Revision: D19782651

Pulled By: houseroad

fbshipit-source-id: 0168fb941e166acda4ca607165248b8e0b260ace
2020-02-07 13:41:26 -08:00
a9583c1f75 Vectorize softplus and its backward function on CPU (#32944)
Summary:
The benchmarking shows a huge performance gain (2-7x faster).

Also note that I removed Half support because it isn't generally supported on CPU.

Benchmark: (Debian 10, Release build, gcc 8.3, no turbo, Intel(R) Xeon(R) E-2136 CPU @ 3.30GHz)

```python
import timeit
for op in ('Softplus',):
    print('Forward')
    for dtype in ('torch.double', 'torch.float'):
        for n, t in [(10_000, 10000),
                    (100_000, 1000)]:
            print(f'torch.nn.{op}()(a), numel() == {n} for {t} times, dtype={dtype}')
            print(timeit.timeit('m(a)', setup=f'import torch; m = torch.nn.{op}(); a = torch.randn({n}, dtype={dtype})', number=t))
    print('Backward')
    for dtype in ('torch.double', 'torch.float'):
        for n, t in [(10_000, 40000),
                    (100_000, 4000)]:
            print(f'torch.nn.{op}()(a), numel() == {n} for {t} times, dtype={dtype}')
            print(timeit.timeit('y.backward(retain_graph=True)',
                                setup=f'import torch; m = torch.nn.{op}(); a = torch.randn({n}, dtype={dtype}, requires_grad=True); x = m(a); y = x.sum()',
                                number=t))
```

Before:

```
Forward
torch.nn.Softplus()(a), numel() == 10000 for 10000 times, dtype=torch.double
3.73130346799735
torch.nn.Softplus()(a), numel() == 100000 for 1000 times, dtype=torch.double
3.6790116359916283
torch.nn.Softplus()(a), numel() == 10000 for 10000 times, dtype=torch.float
2.7477027159911813
torch.nn.Softplus()(a), numel() == 100000 for 1000 times, dtype=torch.float
2.7382752639969112
Backward
torch.nn.Softplus()(a), numel() == 10000 for 40000 times, dtype=torch.double
7.037510035006562
torch.nn.Softplus()(a), numel() == 100000 for 4000 times, dtype=torch.double
5.855093962003593
torch.nn.Softplus()(a), numel() == 10000 for 40000 times, dtype=torch.float
3.413616877005552
torch.nn.Softplus()(a), numel() == 100000 for 4000 times, dtype=torch.float
2.5485514330066508
```

After:

```
Forward
torch.nn.Softplus()(a), numel() == 10000 for 10000 times, dtype=torch.double
0.9465823079954134
torch.nn.Softplus()(a), numel() == 100000 for 1000 times, dtype=torch.double
0.8799468770012027
torch.nn.Softplus()(a), numel() == 10000 for 10000 times, dtype=torch.float
0.39715987400268205
torch.nn.Softplus()(a), numel() == 100000 for 1000 times, dtype=torch.float
0.3563060039887205
Backward
torch.nn.Softplus()(a), numel() == 10000 for 40000 times, dtype=torch.double
2.400547721001203
torch.nn.Softplus()(a), numel() == 100000 for 4000 times, dtype=torch.double
1.4740848699875642
torch.nn.Softplus()(a), numel() == 10000 for 40000 times, dtype=torch.float
1.6684603010071442
torch.nn.Softplus()(a), numel() == 100000 for 4000 times, dtype=torch.float
0.6815649690106511
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32944

Differential Revision: D19725407

Pulled By: VitalyFedyunin

fbshipit-source-id: 7430de838df731bd17617eff63f10107d5ad6b8b
2020-02-07 11:28:49 -08:00
e7b42209eb Added sparkspot model.
Summary: Lite interpereter does not have softplus and sub ops for this model.

Test Plan:
buck run fbsource//xplat/aibench:run_bench -- -b ../xplat/aibench/specifications/models/pytorch/mobile_migration/sparkspot.json --platform android --framework pytorch --remote --devices SM-G960U-8.0.0-26

 https://our.intern.facebook.com/intern/aibench/details/890521439770638

buck run fbsource//xplat/aibench:run_bench -- -b ../xplat/aibench/specifications/models/pytorch/mobile_migration/sparkspot.json --platform android/arm64 --framework pytorch --remote --devices SM-G960U-8.0.0-26

https://our.intern.facebook.com/intern/aibench/details/485779747361527

For Caffe2:
buck run fbsource//xplat/aibench:run_bench -- -b ../xplat/aibench/specifications/models/caffe2/mobile_migration/sparkspot.json --platform android --framework caffe2 --remote --devices SM-G950U-7.0-24

https://our.intern.facebook.com/intern/aibench/details/177482569133423

Reviewed By: ljk53, iseeyuan

Differential Revision: D19757721

fbshipit-source-id: cdd4b39d072925fc8de17184f2c90918de6245ba
2020-02-07 11:22:06 -08:00
de27f4261d [jit] remove redundant variables from JIT TestCase
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/29091

Differential Revision: D19746083

Pulled By: suo

fbshipit-source-id: 76fd71740fe7a3f52da361d96a7b694ec208de24
2020-02-07 10:42:33 -08:00
d678093907 [ONNX] Extend op registration to next opsets (#32943)
Summary:
Currently, custom ops are registered for a specific opset version.
For example, all torchvision custom ops are registered for opset 11, and cannot be exported into higher opset versions. This PR extends op registration to higher opset versions.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32943

Reviewed By: hl475

Differential Revision: D19739406

Pulled By: houseroad

fbshipit-source-id: dd8b616de3a69a529d135fdd02608a17a8e421bc
2020-02-07 10:37:50 -08:00
3b2f267ad8 add to codeowner to get better inbox notification for PR
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/33087

Differential Revision: D19790389

Pulled By: albanD

fbshipit-source-id: 360ee1fc47a9b0b8d8ddbe47b77f2cbffaead9c8
2020-02-07 07:56:47 -08:00
674dca0831 Automatic update of fbcode/onnx to 8b3f7e2e7a0f2aba0e629e23d89f07c7fc0e6a5e (#33075)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33075

Previous import was 65020daafa9183c769938b4512ce543fd5740f8f

Included changes:
- **[8b3f7e2e](https://github.com/onnx/onnx/commit/8b3f7e2e)**: Update Dropout and  BatchNorm to be Training Friendly (#2568) <Lara Haidar>
- **[61f0bbc5](https://github.com/onnx/onnx/commit/61f0bbc5)**: Fix a bug in ScatterND shape inference (#2577) <Bowen Bao>
- **[05bce9cf](https://github.com/onnx/onnx/commit/05bce9cf)**: add utility function to make reference attribute whose name is not the same as the attribute it refers. (#2583) <Ke Zhang>
- **[71181c83](https://github.com/onnx/onnx/commit/71181c83)**: Clarify spec for constant of shape with dim_n = 0 (#2567) <Negin Raoof>
- **[eadba733](https://github.com/onnx/onnx/commit/eadba733)**: Update sigs.md with link to calendar page (#2579) <Prasanth Pulavarthi>
- **[08562f8e](https://github.com/onnx/onnx/commit/08562f8e)**: Update working-groups.md (#2580) <Prasanth Pulavarthi>
- **[0e718913](https://github.com/onnx/onnx/commit/0e718913)**: Fix Slice op's shape inference logic (#2526) <Hariharan Seshadri>
- **[12111410](https://github.com/onnx/onnx/commit/12111410)**: Add missing spaces to Random*Like doc (#2572) <Takeshi Watanabe>
- **[7e6e61d6](https://github.com/onnx/onnx/commit/7e6e61d6)**: Contributing: fix typos (#2571) <Maher Jendoubi>
- **[bbd604ef](https://github.com/onnx/onnx/commit/bbd604ef)**: Add Einsum op (#2504) <Negin Raoof>
- **[fd3ab73a](https://github.com/onnx/onnx/commit/fd3ab73a)**: Clarify split supports zero length splits (#2544) <Negin Raoof>
- **[6dd73774](https://github.com/onnx/onnx/commit/6dd73774)**: Fix circleci build and drop unsupported Windows builds (#2565) <Wei-Sheng Chin>
- **[b3d201a2](https://github.com/onnx/onnx/commit/b3d201a2)**: Fix the formula of intermediate zero calculation for DynamicQuantizeLinear (#2556) <Yufeng Li>
- **[3613eb25](https://github.com/onnx/onnx/commit/3613eb25)**: Add wording to clarify. (#2555) <Dwayne Robinson>
- **[dfa4384c](https://github.com/onnx/onnx/commit/dfa4384c)**: Fix shape inference for Split with split attribute (#2328) <Shinichiro Hamaji>
- **[684fc1bc](https://github.com/onnx/onnx/commit/684fc1bc)**: Keep symbolic dims in Concat with a single input (#2418) <Shinichiro Hamaji>

Test Plan: ci

Reviewed By: hl475

Differential Revision: D19784487

fbshipit-source-id: 421cdc3394faeff0168853f4ff065fc599ca3967
2020-02-07 02:18:57 -08:00
e025f393f6 windows template specialization bug (#33076)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33076

attempt at fixing https://github.com/pytorch/pytorch/issues/30886

Test Plan: circleCI with `call "C:\Program Files (x86)\Microsoft Visual Studio\2019\Community\VC\Auxiliary\Build\vcvarsall.bat" x64 -vcvars_ver=14.16` passes

Differential Revision: D19784550

fbshipit-source-id: 9fb42c3854d1d00d96cd7179bef9dd1aa2972ea6
2020-02-07 00:41:22 -08:00
05d18ffaf5 Distributed Autograd: Allow multiple backward passes to accumulate gradients. (#32506)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32506

In this PR, we've introduced a `retain_graph` parameter to distributed
autograd similar to `torch.autograd.backward`.

In terms of design, this parameter is sent over RPC to all nodes and is used to
create the GraphTask on the local nodes. This enables us to run
`dist_autograd.backward()` multiple times in the same context.

The use case currently for this is to benchmark only the backward pass for
distributed autograd. We'd like to measure the QPS for the backward pass and as
a result, running a single forward pass and multiple backward passes in a loop
is one way to benchmark backward pass performance.
ghstack-source-id: 97868900

Test Plan: waitforbuildbot

Differential Revision: D19521288

fbshipit-source-id: 7ad8521059fd400d7b5a6ab77ce56e1927ced90a
2020-02-06 23:27:21 -08:00
f0d7bd41b9 [jit] Minor: avoid recalculating some keys for map accesses in pickler. (#33060)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33060

Noticed this when tracking down a partially-related SIGSEGV.
If inserting a non-present key into a memoized map, don't re-calculate it twice
(probably safer that way anyway).
ghstack-source-id: 97904485

Test Plan: buck test mode/dev-nosan caffe2/test/...

Differential Revision: D19778008

fbshipit-source-id: 95b1d708c034a54b96a22ccbdffb24f72d08dffd
2020-02-06 21:25:04 -08:00
10db323b75 Updating submodules
Summary:
GitHub commits:

4121390031
fdd24faa6c
94471e632b
0a24425afd
8b79c69b6c
99f3917826
3853cef0ba
5db0cb90fc
714edbb20f
880ade1420

Test Plan: n/a

Reviewed By: yns88

fbshipit-source-id: a63558a8df40c936d8959287f815835502b6cbd9
2020-02-06 21:01:50 -08:00
afa8cbf8c2 Modifed randNLike for scripting (#32830)
Summary:
the rand N like function had required args which were not being used.
As such modified the method signature to give default values so when
scripting does not provide these arguments which are not even being
used, no error is thrown.

Additionally modified the const checker for handling prim::Constant as
well
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32830

Reviewed By: hl475

Differential Revision: D19731715

Pulled By: houseroad

fbshipit-source-id: a3cacb3977eecb88b122e0ceb654fdbf1c8286c1
2020-02-06 18:19:42 -08:00
432858c960 [ONNX] Fix exporting copy_ with index as tensor input (#32801)
Summary:
Supporting the below case. Previously index for copy_ was only considered as constant integer, where as it could be a tensor input as well.

```python
class InPlaceIndexedAssignment(torch.nn.Module):
    def forward(self, data, index, new_data):
        data[index] = new_data
        return data

data = torch.zeros(3, 4)
index = torch.tensor(1)
new_data = torch.arange(4).to(torch.float32)
torch.onnx.export(InPlaceIndexedAssignment(), (data, index, new_data), 'inplace_assign.onnx', opset_version=11)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32801

Reviewed By: hl475

Differential Revision: D19731666

Pulled By: houseroad

fbshipit-source-id: 08703fdccd817f901282e19847e259d93929e702
2020-02-06 18:11:47 -08:00
ca33aeba09 [JIT] Add Exit Transform / Convert To SSA to docs
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/24114

Differential Revision: D19780828

Pulled By: eellison

fbshipit-source-id: d481ad886b2ad6349a1646672e507336d45759fb
2020-02-06 18:04:06 -08:00
b0476dc6e6 Fix Typo
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/33038

Differential Revision: D19769127

Pulled By: zou3519

fbshipit-source-id: 53a7fa603b097d7070ca484997a587ec74e87357
2020-02-06 11:16:56 -08:00
38820a7014 [JIT] Resolve custom classes in source importer (#32977)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32977
ghstack-source-id: 97736042

Test Plan: Imported from OSS

Differential Revision: D19724588

fbshipit-source-id: b31b6ae14d2881d3604922e611fe4749108e674d
2020-02-06 10:45:40 -08:00
757cea92a4 [c10] Allow taking a std::tuple as arg (#32948)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32948
ghstack-source-id: 97736044

Test Plan: Imported from OSS

Differential Revision: D19709119

fbshipit-source-id: 26b069a95ae7a79a2d5cbe3845eb1a5dcd398be1
2020-02-06 10:44:31 -08:00
8195961f20 Revert D19730209: [pytorch][PR] Issue a warning when using zero_grad in DataParallel
Test Plan: revert-hammer

Differential Revision:
D19730209

Original commit changeset: cb9b2cb0c2e0

fbshipit-source-id: 5bf53ea3c37a7ed2411a2acc34e40d07eff144c9
2020-02-06 07:05:51 -08:00
ec1e9a1ae2 Revert D19417087: fix #30480 torch.normal shape checking is broken
Test Plan: revert-hammer

Differential Revision:
D19417087

Original commit changeset: 1c4bc7df9231

fbshipit-source-id: ee579304cd79e48a6ce87daf490b53baabc655a8
2020-02-06 07:01:29 -08:00
e76fa9822d [C2] Introduce extra_info force CPU tags for auto-generated iteration counter blobs (#32607)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32607

As desc.

Test Plan: Unit-test.

Reviewed By: xw285cornell, chocjy

Differential Revision: D19551567

fbshipit-source-id: 3a121351d2b4016e99a1536dec746be970698664
2020-02-05 23:49:27 -08:00
3c17cbb6c8 fix #30480 torch.normal shape checking is broken (#32243)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32243

Following what gchanan proposed in #30480
- If the (logical) shapes of mean and std are broadcastable, we broadcast them for the output
  Done in tensor iterator already.
- If the (logical) shapes of mean and std are not broadcastable and they have the same number of elements, we fall back to the old behavior (pick the shape of mean)
  Done by reshape std to the same shape of mean.
- If the (logical) shapes of mean and std are not broadcastable and don't have the same number of elements, we error out.
  Done by tensor iterator already.

Test Plan: Imported from OSS

Differential Revision: D19417087

Pulled By: glaringlee

fbshipit-source-id: 1c4bc7df923110a803620b9e2abd11a7151fc33e
2020-02-05 23:47:14 -08:00
b00345a6f2 Move normal distribution to Aten(CPU) (#32031)
Summary:
Fix https://github.com/pytorch/pytorch/issues/24746
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32031

Differential Revision: D19729002

Pulled By: ezyang

fbshipit-source-id: f571368a8a2ac4068c937062167a2fd89e64098c
2020-02-05 20:39:40 -08:00
46c3c18bcc Issue a warning when using zero_grad in DataParallel (#32870)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/31768

`DataParallel` creates replicas of the original `nn.Module` with the parameters duplicated onto the destination devices. Calling `backwards` will propagate gradients onto the original module parameters but calling `zero_grad` on the replica module doesn't clear the gradients from the parent module,

~breaking any model that uses `backward`-`zero_grad` in its `forward`. I fix this by patching the replica module so that `zero_grad` clears grads on the parent as well.~

However, any replica using backwards was broken anyway since the replica's parameters are not leaf nodes in autograd. So, we should raise a warning.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32870

Differential Revision: D19730209

Pulled By: ezyang

fbshipit-source-id: cb9b2cb0c2e0aca688ce0ff3e56b40fbd2aa3c66
2020-02-05 20:25:04 -08:00
6209412647 Add option to use ninja to compile ahead-of-time cpp_extensions (#32495)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32495

Background
------------------------------
Previously, ninja was used to compile+link inline cpp_extensions and
ahead-of-time cpp_extensions were compiled with distutils. This PR adds
the ability to compile (but not link) ahead-of-time cpp_extensions with ninja.

The main motivation for this is to speed up cpp_extension builds: distutils
does not make use of parallelism. With this PR, using the new option, on my machine,
- torchvision compilation goes from 3m43s to 49s
- nestedtensor compilation goes from 2m0s to 28s.

User-facing changes
------------------------------

I added a `use_ninja` flag to BuildExtension. This defaults to
`True`. When `use_ninja` is True:
- it will attempt to use ninja.
- If we cannot use ninja, then this throws a warning and falls back to
distutils.
- Situations we cannot use ninja: Windows (NYI, I'll open a new issue
for this), if ninja cannot be found on the system.

Implementation Details
------------------------------

This PR makes this change in two steps. Please me know if it would be
easier to review this if I split this up into a stacked diff.
Those changes are:
1) refactor _write_ninja_file to separate the policy (what compiler flags
to pass) from the mechanism (how to write the ninja file and do compilation).
2) call _write_ninja_file and _run_ninja_build while building
ahead-of-time cpp_extensions. These are only used to compile objects;
distutils still handles the linking.

Change 1: refactor _write_ninja_file to seperate policy from mechanism
- I split _write_ninja_file into: _write_ninja_file and
_write_ninja_file_to_build_library
- I renamed _build_extension_module to _run_ninja_build

Change 2: Call _write_ninja_file while building ahead-of-time
cpp_extensions
- _write_ninja_file_and_compile_objects calls _write_ninja_file to only
build object files.
- We monkey-patch distutils.CCompiler.compile to call
_write_ninja_files_and_compile_objects
- distutils still handles the linking step. The linking step is not a
bottleneck so it was not a concern.
- This change only works on unix-based systems. Our code for windows
goes down a different codepath and I did not want to mess with that.
- If a system does not support ninja, we raise a warning and fall back
to the original compilation path.

Test Plan
------------------------------

Adhoc testing
- I built torchvision using pytorch master and printed out the build
commands. Next, I used this branch to build torchvision and looked at
the ninja file. I compared the ninja file with the build commands and
asserted that they were functionally the same.
- I repeated the above for pytorch/nestedtensor.

PyTorch test suite
- I split `test_cpp_extensions` into `test_cpp_extensions_aot` and
`test_cpp_extensions_jit`. The AOT (ahead-of-time) version tests
ahead-of-time and the JIT version tests just-in-time (not to be confused
with TorchScript)
- `test_cpp_extensions_aot` gets run TWICE by run_test.py, once with
a module that was built with ninja, and once with a module that was
built without ninja.
- run_test.py asserts that when we are building with use_ninja=True,
ninja is actually available on the system.

Test Plan: Imported from OSS

Differential Revision: D19730432

Pulled By: zou3519

fbshipit-source-id: 819590d01cf65e8da5a1e8019b8b3084792fee90
2020-02-05 18:49:29 -08:00
e54d954572 [ONNX] Add flag to enable script tests (#32654)
Summary:
This will allow us to incrementally enable more tests for scripting as we put in fixes. houseroad spandantiwari
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32654

Reviewed By: hl475

Differential Revision: D19583401

Pulled By: houseroad

fbshipit-source-id: 8dc05e4784df819c939dffdf33b00cbb80bfa364
2020-02-05 17:51:00 -08:00
1b746b95fb Consider hub_dir alongside TORCH_HOME env variable for storing hub models (#32844)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/31944
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32844

Differential Revision: D19747566

Pulled By: ailzhang

fbshipit-source-id: caca41a3a057d7d280d4783515aba2cc48c82012
2020-02-05 15:35:53 -08:00
74ce3a032c Fix some bugs with zipfile serialization (#32244)
Summary:
Stacked PRs
 * #32958 - Make zip serialization the default
 * **#32244 - Fix some bugs with zipfile serialization**

It includes the following changes:
* Split up tests so that we can test both serialization methods
    * Loading something within a buffer doesn't work anymore, so those tests are only on the old serialization method (it's possible but introduces a big slowdown since it requires a linear scan of the entire zipfile to find the magic number at the end)
* Call `readinto` on a buffer if possible instead of `read` + a copy
* Disable CRC-32 checks on read (there was some issue where miniz said the CRC was wrong but `zipinfo` and `unzip` said the zip file was fine)
](https://our.intern.facebook.com/intern/diff/19418935/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32244

Pulled By: driazati

Reviewed By: eellison

Differential Revision: D19418935

fbshipit-source-id: df140854f52ecd04236225417d625374fd99f573
2020-02-05 15:32:14 -08:00
ab75d64e6e Add ability to abort NCCL communicators from the store. (#32895)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32895

When a particular rank calls `ncclCommAbort` on a communicator, it is
important to ensure all other ranks call `ncclCommAbort` on their respective
communicators. If this is not done, the other ranks could get stuck causing the
GPU to spin with 100% utilization.

To alleviate this issue, whenever any rank calls `ncclCommAbort` we put the
unique communicator id in the store. The NCCL watchdog thread then monitors the
store and aborts any communicators found in the store as "aborted".

A few more general fixes in this PR:

1) Use std::shared_ptr for the store in PrefixStore. PrefixStore was using a
reference to the store and when that reference went out of scope the store
object it was holding onto was invalid. This caused a segfault in the watchdog
thread.
2) Enhanced logging for the watchdog thread.

Test Plan: waitforbuildbot

Differential Revision: D19638159

fbshipit-source-id: 596cd87c9fe6d4aeaaab4cb7319cc37784d06eaa
2020-02-05 15:28:05 -08:00
df1d68d52e [jit] fix parser for one-line functions (#32941)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32941

The Python grammar allows single-statement one-line functions. So we
should allow it in the string parser.

Test Plan: Imported from OSS

Differential Revision: D19704153

Pulled By: suo

fbshipit-source-id: 8c06cc9c600aa2a9567b484a1ecc0360aad443e3
2020-02-05 13:11:47 -08:00
908b451efb Enabling the nccl/rccl test for ROCM environment (#32340)
Summary:
Enabling the RCCL test on rocm by adding a temporary grace period to clean up.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32340

Differential Revision: D19744459

Pulled By: xw285cornell

fbshipit-source-id: 1af3b64113a67f93e622d010ddd3020e5d6c8bc8
2020-02-05 12:02:31 -08:00
e8581869f2 Properly update _flat_weights in RNN models (#32989)
Summary:
Resubmitting https://github.com/pytorch/pytorch/issues/32939
Should fix https://github.com/pytorch/pytorch/issues/32346 hopefully. Now when _flat_weights list is updated, None elements are appended to it if some weights are missing, subsequent setattr calls for the missing weights should repair _flat_weights and make it suitable to use in the backend.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32989

Differential Revision: D19731952

Pulled By: ngimel

fbshipit-source-id: 2118a19840491e7ab0fef15185fad982f42795a6
2020-02-05 11:53:41 -08:00
72b9412be2 Move some broadcasting logic away from codegen. (#32982)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32982

For masked_scatter_ and masked_fill_ (which already have manually written wrappers), move the broadcasting logic into the manually written wrappers.

Test Plan: Imported from OSS

Differential Revision: D19726830

Pulled By: gchanan

fbshipit-source-id: 1f6e55e19c1314a76e43946b14d58f147c0f8204
2020-02-05 10:23:49 -08:00
fbde3c05b6 [aten] fix vector memory leak (#32478)
Summary:
free(y) missing.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32478

Differential Revision: D19728471

Pulled By: agolynski

fbshipit-source-id: 73e7933c832f9c19f3fe09df76699c7b335a87bd
2020-02-05 10:18:54 -08:00
81a9046301 Fix dispatch of argmax/argmin. (#32961)
Summary:
The way we currently dispatch argmax/argmin to out-of-source devices is bad and caused issues, e.g it doesn't work well when the input requires grad. https://github.com/pytorch/xla/issues/1585.
Making argmax/argmin dispatch at device level resolves it.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32961

Differential Revision: D19726826

Pulled By: ailzhang

fbshipit-source-id: f7fb445fd8e7691524afcc47d24d8e6b0171d10c
2020-02-05 10:17:50 -08:00
3531f99384 Kill _th_max, _th_min overloads that aren't used.
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/32981

Test Plan: Imported from OSS

Differential Revision: D19726831

Pulled By: gchanan

fbshipit-source-id: 22b5b9115838360850c4ee250ed95742f3444dc8
2020-02-05 09:20:21 -08:00
16c166e2ea Add XLAPreAutograd key for XLA use cases that need custom autograd. (#32788)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32788

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Differential Revision: D19628643

Pulled By: ezyang

fbshipit-source-id: 7099b08eff37913144b961dda00b070bd4b939d4
2020-02-05 08:10:02 -08:00
6b0813ea5d Stop using dispatchTypeId to do checks for tensor list unwrap. (#32787)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32787

Gets rid of a longstanding TODO.  TensorList unwrap is only used for cat, which
means we can assume that the inputs are dense, and do something similar to how
we do the dense tensor wrapping above.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Differential Revision: D19628642

Pulled By: ezyang

fbshipit-source-id: 3264439407585fb97995a9a2302c2913efecb421
2020-02-05 08:08:16 -08:00
1b446aa2ee Expose Channel Last 3d enum
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/32947

Test Plan: Imported from OSS

Differential Revision: D19707716

Pulled By: glaringlee

fbshipit-source-id: 03824769376043bc6151a4580aba27654de5077f
2020-02-04 23:33:19 -08:00
836b4c9e64 Attempt to workaround MSVC17 static constexpr bug
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/33002

Test Plan: Imported from OSS

Differential Revision: D19739097

Pulled By: jamesr66a

fbshipit-source-id: 7ce54ddb1f56a741d88d3215b154192171c54dfa
2020-02-04 22:33:22 -08:00
f393adc0ed [JIT] Fix python pickle serialization for torchbind (#32878)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32878
ghstack-source-id: 97736045

Test Plan: Imported from OSS

Differential Revision: D19669879

fbshipit-source-id: 23ea91cffe7344d1eed014e2509983c281dd18d3
2020-02-04 19:29:55 -08:00
23a4800708 [JIT] Make IRParser use op schema (#32854)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32854
ghstack-source-id: 97736043

Test Plan: Imported from OSS

Differential Revision: D19656881

fbshipit-source-id: 509d09fdbd765ca5cd153bec6440aedfb4e6d23b
2020-02-04 19:29:50 -08:00
bc4790b3aa [JIT] Trace uses of torchbind classes as module attributes (#32833)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32833
ghstack-source-id: 97736046

Test Plan: Imported from OSS

Differential Revision: D19645714

fbshipit-source-id: 10a7271f13c3588aea666b44b916e90ba7b3c666
2020-02-04 19:28:37 -08:00
d141465713 Fix torch::allclose to handle std::numeric_limits<T>::lowest() for integral types (#32978)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32978

Fixes #32946

Test Plan: Imported from OSS

Differential Revision: D19726013

Pulled By: pbelevich

fbshipit-source-id: ada4aeabc8e39016d24f1a40f02fb7c56f069cd3
2020-02-04 19:06:52 -08:00
e4f633ba0b Updating submodules
Summary:
GitHub commits:

619d2503cb
c442208177
75d9b18eba
ed5142083a

Test Plan: n/a

Reviewed By: yns88

fbshipit-source-id: 11a53fea064f8e40c2a89d3068421d7cad231d00
2020-02-04 16:36:24 -08:00
4502d8c391 Interpolate Float [] support in ONNX (#32554)
Summary:
The PR https://github.com/pytorch/pytorch/pull/31791 adds support for float[] constant, which affects some cases of ONNX interpolate support.
This PR adds float[] constants support in ONNX, updates interpolate in ONNX, and re-enable the disabled tests.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32554

Reviewed By: hl475

Differential Revision: D19566596

Pulled By: houseroad

fbshipit-source-id: 843f62c86126fdf4f9c0117b65965682a776e7e9
2020-02-04 16:14:40 -08:00
bda874b480 [rpc] throw correct Exception on local client based on the RemoteException (#32936)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32936

Closes https://github.com/pytorch/pytorch/issues/32732. Currently if a
UDF run in RPC throws an exception such as ValueError or TypeError, we wrap
this in a RemoteException on the callee side. When raising this on the caller
side, we currently raise a vanilla Exception. This diff changes it so that the
correct exception is thrown. Tested by changing the current rpc tests to assert
on the right type of error rather than just the base `Exception`.
ghstack-source-id: 97706957

Test Plan: Modified unit test.

Differential Revision: D19700434

fbshipit-source-id: e451b772ea6aecc1d2e109e67e7f932eb9151f15
2020-02-04 16:08:25 -08:00
a9141dd240 Patch Half.h for compiling CUDA with clang (#29027)
Summary:
Following discussion: https://github.com/pytorch/pytorch/issues/28417
Pull Request resolved: https://github.com/pytorch/pytorch/pull/29027

Differential Revision: D19698745

Pulled By: ezyang

fbshipit-source-id: fab4be3bcbac8f3b334d7e0a56e6a790e2c6b6d8
2020-02-04 15:05:52 -08:00
7ea6559658 Add size checks to torch.stack (#32931)
Summary:
Checks the size of each tensor passed to `torch.stack` before calling `cat` to address https://github.com/pytorch/pytorch/issues/29510. This is done in the `get_stack_input` function as that is a common path. The function now compares the size of each tensor in the TensorList to the size of the first tensor and throws an exception when the sizes are not equal.

To compare:
```
x = torch.zeros([1, 2])
y = torch.zeros([1, 3])
torch.stack([x, y]) # Errors due to size differences
```
Current error:
```
RuntimeError: invalid argument 0: Sizes of tensors must match
except in dimension 0. Got 2 and 3 in dimension 2 at (path)\aten\src\TH/generic/THTensor.cpp:612
```
New error:
```
RuntimeError: stack expects each tensor to be equal size, but
got [1, 2] at entry 0 and [1, 3] at entry 1
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32931

Differential Revision: D19700110

Pulled By: ezyang

fbshipit-source-id: 7e18bb00fa2c137e418e340d719b6b76170b83e3
2020-02-04 15:00:54 -08:00
58e8d5588a [ONNX] Export bitwise_not for bool (logical_not) (#28439)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/25805 (for bool tensors as in the issue)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/28439

Differential Revision: D19700156

Pulled By: ezyang

fbshipit-source-id: 0706ada6a8d259dce381ba2d009f226e14c3c14f
2020-02-04 14:45:58 -08:00
4f5908d5d7 Remove unneded TORCH_API (#32015)
Summary:
It was causing a build error when compiling on MINGW64
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32015

Differential Revision: D19697296

Pulled By: ezyang

fbshipit-source-id: 71e58783c48f8e99755c091b2027d59740dfca47
2020-02-04 14:44:35 -08:00
6305e4a88f Add warning and example for seeding to DistributedSampler (#32951)
Summary:
Closes gh-31771

Also note that the `epoch` attribute is *only* used as a manual seed in each iteration (so it could easily be changed/renamed).  Seeding consecutive iterations with `[0, 1, 2, ...]` is low-entropy, however in practice it probably doesn't matter when using the sampler in combination with a dataloader (because there won't be enough data nor epochs to run into statistical issues
due to low-entropy seeding). So leaving that as is.

Rendered docstring:

<img width="534" alt="image" src="https://user-images.githubusercontent.com/98330/73701250-35134100-46e9-11ea-97b8-3baeb60fcb37.png">
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32951

Differential Revision: D19729333

Pulled By: ezyang

fbshipit-source-id: 3ddf90a3828b8bbae88aa2195a5d0b7d8ee1b066
2020-02-04 14:36:59 -08:00
b0d5ce3848 Revert D19710990: [pytorch][PR] properly update _flat_weights in RNN modules
Test Plan: revert-hammer

Differential Revision:
D19710990

Original commit changeset: c978c7519464

fbshipit-source-id: 8710bc2f4f1d01d9c93d038b59caf1e6859375dd
2020-02-04 14:35:55 -08:00
cyy
27e1fecabd let user specify CUDA_HOST_COMPILER
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/32904

Differential Revision: D19729047

Pulled By: ezyang

fbshipit-source-id: c233e3924f71a025c51d25a7e3a8d728dac8730a
2020-02-04 14:32:12 -08:00
d3a0bdd06b proofreading (#29797)
Summary:
two instances of if -> it in torch.nn.modules.batchnorm.py
Pull Request resolved: https://github.com/pytorch/pytorch/pull/29797

Differential Revision: D19698613

Pulled By: ezyang

fbshipit-source-id: 7312b2333f227113e904dfa91db90d00e525affb
2020-02-04 14:30:36 -08:00
ea968f5cc3 fix possible pandas import error during tensorboard tests (#29650)
Summary:
TensorBoard tests using SummaryWriter() may fail with a pandas import
complaint if TensorFlow packages are installed in the same python
environment as PyTorch:

Traceback (most recent call last):
  File "test_tensorboard.py", line 212, in test_writer
    with self.createSummaryWriter() as writer:
  File "test_tensorboard.py", line 64, in createSummaryWriter
    return SummaryWriter(temp_dir)
...
  File "[...]/site-packages/pandas/core/arrays/categorical.py", line 52, in <module>
    import pandas.core.algorithms as algorithms
AttributeError: module 'pandas' has no attribute 'core'

The exact failure may depend on the pandas version. We've also seen:

  File "[...]/site-packages/pandas/core/arrays/categorical.py", line 9, in <module>
    import pandas.compat as compat
AttributeError: module 'pandas' has no attribute 'compat'

The module import chain leading to the failure is tensorboard imports
tensorflow imports tensorflow_estimator imports pandas. pandas includes
a submodule named 'bottleneck', whose name collides with the PyTorch
'test/bottleneck/' subdirectory.

So IF tensorboard, tensorflow, tensorflow_estimator, and pandas are
installed in the python environment AND IF testing is run from within
PyTorch's 'test/' directory (or maybe just with 'test/' in PYTHONPATH,
etc.), then TensorBoard tests using SummaryWriter() will fail.

Rename the 'bottleneck/' directory slightly to avoid the name collision.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/29650

Differential Revision: D19698638

Pulled By: ezyang

fbshipit-source-id: cb59342ed407cb37aefc833d67f768a8809129ac
2020-02-04 14:27:46 -08:00
478356aeec Fix broken links in governance.rst
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/30815

Differential Revision: D19697401

Pulled By: ezyang

fbshipit-source-id: d7e1a1b54039624f471b6cfb568428feb73060f4
2020-02-04 14:26:09 -08:00
18d1896ba0 Fix confusing "does not have GPU support" warning message (#30721)
Summary:
Many people who use caffe2 are confused about "does not have GPU support" warning message.
https://github.com/facebookresearch/video-nonlocal-net/issues/6
facebookarchive/caffe2#346
facebookarchive/caffe2#1634
facebookarchive/caffe2#197

Many none GPU reasons can cause this warning message. It is better to give the error info.
![image](https://user-images.githubusercontent.com/13826327/70129721-41175e00-16ba-11ea-85df-a4b1a1690149.png)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30721

Differential Revision: D19697413

Pulled By: ezyang

fbshipit-source-id: bd24b7c814e7e677352068b9e9f77a68de080159
2020-02-04 14:20:00 -08:00
67706187fb Fix a broken link in contribution_guide.rst
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/30814

Differential Revision: D19697403

Pulled By: ezyang

fbshipit-source-id: b01fd0e189b3bc7ccaa197c9c64e12fee70a6310
2020-02-04 14:14:25 -08:00
b69c685c4a try to find cudnn header in /usr/include/cuda (#31755)
Summary:
With fedora negativo17 repo, the cudnn headers are installed in /usr/include/cuda directory, along side with other cuda libraries.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31755

Differential Revision: D19697262

Pulled By: ezyang

fbshipit-source-id: be80d3467ffb90fd677d551f4403aea65a2ef5b3
2020-02-04 14:10:32 -08:00
e999095594 Updating submodules
Summary:
GitHub commits:

8f3d7019bb
a5df50cf5c
b896a52075
3a073234da
7c05bee055
90f0aa9665
5cdd1abbb9

Test Plan: n/a

Reviewed By: yns88

fbshipit-source-id: 70dd062814f68bda77e119bb9deaefbf71c551e6
2020-02-04 13:00:26 -08:00
d3fa68eeec Fix for MKL detection script on Windows (#32970)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/32914.
1. Use `DEFINED ENV{MKLProductDir}` instead of `$ENV{MKLProductDir}`
2. Cache `INTEL_COMPILER_DIR` and `INTEL_MKL_DIR`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32970

Differential Revision: D19727677

Pulled By: soumith

fbshipit-source-id: 065c6bee35a2295f1c478df1460cad7668b25af5
2020-02-04 12:41:39 -08:00
e922826dda [pytorch] simplify lazy initialization of DefaultCPUGenerator singleton (#32897)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32897

Moving the default static instance into the method to achieve the same purpose.
ghstack-source-id: 97570792

Test Plan: - CI

Reviewed By: dreiss

Differential Revision: D19674566

fbshipit-source-id: 27f54da66dd7667c34905eddaac6579e64aa1118
2020-02-04 11:37:14 -08:00
aa3c871739 Adds TestViewOps, updates documentation (#32512)
Summary:
Understanding which ops return views and which return tensors with new storage is a common user issue, and an issue for developers connecting accelerators to PyTorch, too. This generic test suite verifies that ops which should return views do (and a few ops that shouldn't don't).  The documentation has also been updated for .t(), permute(), unfold(), and select() to clarify they return views.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32512

Differential Revision: D19659454

Pulled By: mruberry

fbshipit-source-id: b4334be9b698253a979e1bb8746fdb3ca24aa4e3
2020-02-04 11:10:34 -08:00
341fb6d11d Make caffe2/caffe2/python/models/seq2seq python3 compatible
Test Plan: watiforsadcastle

Reviewed By: dzhulgakov

Differential Revision: D19698403

fbshipit-source-id: 36b73e07e598c848abbe368e522484da9ba4c78f
2020-02-04 10:51:47 -08:00
Jie
9e7c47644f [NHWC CUDNN CONV]Update cudnn convolution memory_format behavior (#32482)
Summary:
1. Allows both the memory_format of weight & input to dictate the output
memory_format.
2. Provides utility function to recursively convert memory_format of Conv2d and
ConvTranspose2d layers. This allows easy model conversion and ensures that lost
memory_format through incompatible layers could be restored at Convolution-like
layer, where significant performance boost is expected on later generation CUDA
devices.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32482

Differential Revision: D19647903

Pulled By: VitalyFedyunin

fbshipit-source-id: 62c96ff6208ff5e84fae1f55b63af9a010ad199a
2020-02-04 09:50:57 -08:00
ec2c974bd5 Simplify some TH codegen by moving code out of the switch and killing dead code. (#32888)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32888

This kills ~1500 lines of generated code by doing the following:
1) Stop binding _th_clone, which isn't used anymore.

2) Move allocation code out of the switch, because it doesn't need to be there, example:
Now:
```
auto dispatch_scalar_type = infer_scalar_type(self);
auto result_ = c10::make_intrusive<TensorImpl, UndefinedTensorImpl>(c10::Storage(scalarTypeToTypeMeta(dispatch_scalar_type), 0, allocator(), true),DispatchKey::CPUTensorId).release();
auto result = Tensor(c10::intrusive_ptr<TensorImpl, UndefinedTensorImpl>::reclaim(result_));
switch (dispatch_scalar_type) {
    case ScalarType::Bool: {
        ...
    case ScalarType::Byte: {
	    ...
```
Before:
```
auto dispatch_scalar_type = infer_scalar_type(self);
switch(dispatch_scalar_type) {
    case ScalarType::Bool: {
       	auto result_ = c10::make_intrusive<TensorImpl, UndefinedTensorImpl>(caffe2::TypeMeta::Make<bool>(), 0, allocator(), true),DispatchKey::CPUTensorId).release();
        auto result = Tensor(c10::intrusive_ptr<TensorImpl, UndefinedTensorImpl>::reclaim(result_));
    case ScalarType::Byte: {
        auto result_ = c10::make_intrusive<TensorImpl, UndefinedTensorImpl>(caffe2::TypeMeta::Make<byte>(), 0, allocator(), true),DispatchKey::CPUTensorId).release();
        auto result = Tensor(c10::intrusive_ptr<TensorImpl, UndefinedTensorImpl>::reclaim(result_));
```

Note there's one extra lookup from ScalarType -> TypeMeta, but that can go away once we are able to put everything in a dispatch macro.

3) Prepare for more moves out of the switch by using dispatch_scalar_type where we would have used an explicit ScalarType::Name
More moves are currently blocked by "real" types needing to map scalar_type -> C++ type.  Dispatch macros can solve that, but I'll need to wrap the actual TH calls in templates so the entire
thing can be done via dispatch.

4) Kill some codegen that isn't used anymore: ALLOC_WRAP, is_actual_return_long.

Test Plan: Imported from OSS

Differential Revision: D19672613

Pulled By: gchanan

fbshipit-source-id: 753f480842d11757e10182e43b471bd3abaa5446
2020-02-04 08:41:20 -08:00
820410b505 Added upsample_neartest2d op for lite interpreter. (#32913)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32913

This enables mobile detection and tracking models.

Test Plan: buck test caffe2/test/cpp/jit:jit -- JitTest.LiteInterpreterUpsampleNearest2d

Reviewed By: iseeyuan

Differential Revision: D19664502

fbshipit-source-id: 1c7270dcf394aba7b510c5aa80552c58a5038f24
2020-02-04 07:59:03 -08:00
b894dc06de [Pytorch] Propagate errors in clearAndWaitForOutstandingRpcsAsync. (#32952)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32952

When the Async() version of clearAndWaitForOutstandingRpcs() was written,
we didn't yet have the generic Future<T> class, and hadn't worked out our
error model fully.

This change fixes that method to properly propagate the first encountered error
to the future, using a bool+CAS.
ghstack-source-id: 97665749

Test Plan: existing test coverage, buck test mode/dev-nosan caffe2/test/...

Differential Revision: D19710337

fbshipit-source-id: 66ce5593a94a16ea624930dbb9409917ef5cfd5d
2020-02-03 20:47:51 -08:00
b4b1b100bd Add a loop test for onnxified net (#32935)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32935

Mock away the content of onnxified net with some low cost ops so that we can still mimic the input/output transfer while doing minimal work on the card.

Test Plan:
```
buck run glow/fb/test:sparsenn_test -- --gtest_filter='SparseNNTest.vanillaC2' --onnxifi_debug_mode --onnxifi_loop_test_mode --nocaffe2_predictor_use_memonger
```

Differential Revision: D19631971

fbshipit-source-id: f970c55ccb410702f479255eeb750e01e3f8c2ae
2020-02-03 18:35:41 -08:00
df71b3e23a properly update _flat_weights in RNN modules (#32939)
Summary:
Should fix https://github.com/pytorch/pytorch/issues/32346 hopefully. Now when _flat_weights list is updated, `None` elements are appended to it if some weights are missing, subsequent `setattr` calls for the missing weights should repair _flat_weights and make it suitable to use in the backend.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32939

Differential Revision: D19710990

Pulled By: ngimel

fbshipit-source-id: c978c7519464e94beeffa9bc33b9172854a2f298
2020-02-03 18:27:00 -08:00
3cac9900ca Clarify when softplus is reverted to linear. (#32945)
Summary:
The default value is removed because it is explained right below.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32945

Reviewed By: soumith

Differential Revision: D19706567

Pulled By: ailzhang

fbshipit-source-id: 1b7cc87991532f69b81aaae2451d944f70dda427
2020-02-03 17:54:31 -08:00
544eab37d0 Move deprecation warning out of generated code into python_arg_parser. (#32907)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32907

All op-specific information used in this logic was available to the
parser itself, so the check can be done in that context, no codegen
needed.

No change in the warning behavior itself, mod minor formatting tweak -
passes existing tests. Saves like ~275K binary size on mac:
```
-rwxr-xr-x  1 bhosmer  1876110778   16502064 Feb  1 00:43 torch/lib/libtorch_python.dylib
-rwxr-xr-x  1 bhosmer  1876110778   16247888 Feb  1 00:44 torch/lib/libtorch_python.dylib
```

[codegen diff](https://github.com/bhosmer/scratch/compare/deprecation_warning_before...deprecation_warning_after)

More important than the size savings is the minimization of codegen. Ideally the generated artifact should express distinctive per-op properties in as minimal a form as practically possible - e.g. here instead of generating check-and-warn behavior into every binding, we generate only the data that triggers the behavior in the parser. (And actually we were generating it already.)

Test Plan: Imported from OSS

Differential Revision: D19679928

Pulled By: bhosmer

fbshipit-source-id: cf0140573118430720c6b797c762fe5be98acd86
2020-02-03 17:47:04 -08:00
612e621da0 Improve CHECK_OP macro (#29539)
Summary:
- Show values in question like glog.
- Handle expressions with logical operators properly by adding
  parentheses around expressions.
- Allow outputting nullptr (some build failed without this)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/29539

Reviewed By: dreiss

Differential Revision: D19698991

Pulled By: ljk53

fbshipit-source-id: e329c01622cfc386ac009904092519a4adfe94a8
2020-02-03 17:27:41 -08:00
5ca7bf453d Tests for verifying behaviour of BatchNorm using 0-dim batch sizes. (#32384)
Summary:
The `BatchNorm*` part of the issue (see gh-12013) seems to have been fixed in the master branch and these tests would make it concrete.

However I would appreciate comments on https://github.com/pytorch/pytorch/issues/12013#issuecomment-575871264 on whether the current behaviour is satisfactory.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32384

Differential Revision: D19704154

Pulled By: ngimel

fbshipit-source-id: 1bbbbf1ae1215a460b22cf26e6b263e518ecf60b
2020-02-03 16:58:23 -08:00
9c2ed2574a Vectorized memory access in TensorIterator GPU loop for 1d contiguous case (#32383)
Summary:
Step 2 of https://github.com/pytorch/pytorch/issues/31975

Vectorized memory access is enabled. Generated code: https://github.com/zasdfgbnm/things/blob/master/2020Q1/disassembly-elementwise-vec.ipynb

```
void at::native::modern::elementwise_kernel<4, 64, 4, at::native::add_kernel_cuda(at::TensorIterator&, c10::Scalar)::{lambda()https://github.com/pytorch/pytorch/issues/1}::operator()() const::{lambda()https://github.com/pytorch/pytorch/issues/4}::operator()() const::{lambda(float, float)https://github.com/pytorch/pytorch/issues/1}, at::detail::Array<char*, 3> >(int, at::native::add_kernel_cuda(at::TensorIterator&, c10::Scalar)::{lambda()https://github.com/pytorch/pytorch/issues/1}::operator()() const::{lambda()https://github.com/pytorch/pytorch/issues/4}::operator()() const::{lambda(float, float)https://github.com/pytorch/pytorch/issues/1}, at::detail::Array<char*, 3>)

**ASM:**

	.section	.text._ZN2at6native6modern18elementwise_kernelILi4ELi64ELi4EZZZNS0_15add_kernel_cudaERNS_14TensorIteratorEN3c106ScalarEENKUlvE_clEvENKUlvE2_clEvEUlffE_NS_6detail5ArrayIPcLi3EEEEEviT2_T3_,"ax",progbits
	.sectioninfo	@"SHI_REGISTERS=20"
	.align	128
        .global         _ZN2at6native6modern18elementwise_kernelILi4ELi64ELi4EZZZNS0_15add_kernel_cudaERNS_14TensorIteratorEN3c106ScalarEENKUlvE_clEvENKUlvE2_clEvEUlffE_NS_6detail5ArrayIPcLi3EEEEEviT2_T3_
        .type           _ZN2at6native6modern18elementwise_kernelILi4ELi64ELi4EZZZNS0_15add_kernel_cudaERNS_14TensorIteratorEN3c106ScalarEENKUlvE_clEvENKUlvE2_clEvEUlffE_NS_6detail5ArrayIPcLi3EEEEEviT2_T3_,function
        .size           _ZN2at6native6modern18elementwise_kernelILi4ELi64ELi4EZZZNS0_15add_kernel_cudaERNS_14TensorIteratorEN3c106ScalarEENKUlvE_clEvENKUlvE2_clEvEUlffE_NS_6detail5ArrayIPcLi3EEEEEviT2_T3_,(.L_40898 - _ZN2at6native6modern18elementwise_kernelILi4ELi64ELi4EZZZNS0_15add_kernel_cudaERNS_14TensorIteratorEN3c106ScalarEENKUlvE_clEvENKUlvE2_clEvEUlffE_NS_6detail5ArrayIPcLi3EEEEEviT2_T3_)
        .other          _ZN2at6native6modern18elementwise_kernelILi4ELi64ELi4EZZZNS0_15add_kernel_cudaERNS_14TensorIteratorEN3c106ScalarEENKUlvE_clEvENKUlvE2_clEvEUlffE_NS_6detail5ArrayIPcLi3EEEEEviT2_T3_,@"STO_CUDA_ENTRY STV_DEFAULT"
_ZN2at6native6modern18elementwise_kernelILi4ELi64ELi4EZZZNS0_15add_kernel_cudaERNS_14TensorIteratorEN3c106ScalarEENKUlvE_clEvENKUlvE2_clEvEUlffE_NS_6detail5ArrayIPcLi3EEEEEviT2_T3_:
.text._ZN2at6native6modern18elementwise_kernelILi4ELi64ELi4EZZZNS0_15add_kernel_cudaERNS_14TensorIteratorEN3c106ScalarEENKUlvE_clEvENKUlvE2_clEvEUlffE_NS_6detail5ArrayIPcLi3EEEEEviT2_T3_:
	//## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/Loops.cuh", line 294
        /*0000*/                   IMAD.MOV.U32 R1, RZ, RZ, c[0x0][0x28] ;
        /*0010*/              @!PT SHFL.IDX PT, RZ, RZ, RZ, RZ ;
        /*0020*/                   S2R R9, SR_CTAID.X ;
	//## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/MemoryAccess.cuh", line 177
        /*0030*/                   S2R R0, SR_TID.X ;
	//## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/Loops.cuh", line 294
        /*0040*/                   IMAD.SHL.U32 R9, R9, 0x100, RZ ;
        /*0050*/                   IADD3 R5, -R9, c[0x0][0x160], RZ ;
	//## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/Loops.cuh", line 256
        /*0060*/                   SHF.R.S32.HI R17, RZ, 0x1f, R9 ;
	//## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/Loops.cuh", line 296
        /*0070*/                   ISETP.GE.AND P0, PT, R5, 0x100, PT ;
        /*0080*/              @!P0 BRA `(.L_3173) ;
	//## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/Loops.cuh", line 256
        /*0090*/                   IMAD.SHL.U32 R12, R9.reuse, 0x4, RZ ;
        /*00a0*/                   SHF.L.U64.HI R17, R9, 0x2, R17 ;
	//## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/Loops.cuh", line 260
        /*00b0*/                   IADD3 R8, P0, R12.reuse, c[0x0][0x188], RZ ;
        /*00c0*/                   IADD3 R2, P1, R12, c[0x0][0x190], RZ ;
        /*00d0*/                   IADD3.X R9, R17.reuse, c[0x0][0x18c], RZ, P0, !PT ;
        /*00e0*/                   IADD3.X R3, R17, c[0x0][0x194], RZ, P1, !PT ;
	//## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/MemoryAccess.cuh", line 218
        /*00f0*/                   IMAD.WIDE R8, R0, 0x10, R8 ;
        /*0100*/                   IMAD.WIDE R2, R0, 0x10, R2 ;
        /*0110*/                   LDG.E.128.SYS R8, [R8] ;
        /*0120*/                   LDG.E.128.SYS R4, [R2] ;
	//## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/Loops.cuh", line 256
        /*0130*/                   IADD3 R12, P0, R12, c[0x0][0x180], RZ ;
        /*0140*/                   IADD3.X R13, R17, c[0x0][0x184], RZ, P0, !PT ;
	//## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/MemoryAccess.cuh", line 238
        /*0150*/                   IMAD.WIDE R12, R0, 0x10, R12 ;
	//## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/Loops.cuh", line 196
        /*0160*/                   FFMA R7, R7, c[0x0][0x168], R11 ;
        /*0170*/                   FFMA R6, R6, c[0x0][0x168], R10 ;
        /*0180*/                   FFMA R5, R5, c[0x0][0x168], R9 ;
        /*0190*/                   FFMA R4, R4, c[0x0][0x168], R8 ;
	//## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/MemoryAccess.cuh", line 238
        /*01a0*/                   STG.E.128.SYS [R12], R4 ;
	//## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/Loops.cuh", line 301
        /*01b0*/                   EXIT ;
.L_3173:
	//## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/MemoryAccess.cuh", line 180
        /*01c0*/                   ISETP.GE.AND P0, PT, R0, R5, PT ;
        /*01d0*/                   BMOV.32.CLEAR RZ, B0 ;
        /*01e0*/                   BSSY B0, `(.L_3174) ;
        /*01f0*/               P0 BRA `(.L_3175) ;
	//## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/MemoryAccess.cuh", line 183
        /*0200*/                   IADD3 R3, P1, R9, R0, RZ ;
        /*0210*/                   LEA.HI.X.SX32 R4, R0, R17, 0x1, P1 ;
        /*0220*/                   LEA R2, P1, R3, c[0x0][0x188], 0x2 ;
        /*0230*/                   LEA.HI.X R3, R3, c[0x0][0x18c], R4, 0x2, P1 ;
        /*0240*/                   LDG.E.SYS R8, [R2] ;
	//## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/MemoryAccess.cuh", line 184
        /*0250*/                   IADD3 R4, R0, 0x40, RZ ;
	//## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/MemoryAccess.cuh", line 180
        /*0260*/                   ISETP.GE.AND P1, PT, R4, R5, PT ;
        /*0270*/               P1 BRA `(.L_3175) ;
	//## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/MemoryAccess.cuh", line 183
        /*0280*/                   LDG.E.SYS R4, [R2+0x100] ;
	//## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/MemoryAccess.cuh", line 184
        /*0290*/                   IADD3 R6, R0, 0x80, RZ ;
	//## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/MemoryAccess.cuh", line 180
        /*02a0*/                   ISETP.GE.AND P1, PT, R6, R5, PT ;
        /*02b0*/               P1 BRA `(.L_3175) ;
	//## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/MemoryAccess.cuh", line 184
        /*02c0*/                   IADD3 R10, R0, 0xc0, RZ ;
	//## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/MemoryAccess.cuh", line 183
        /*02d0*/                   LDG.E.SYS R7, [R2+0x200] ;
	//## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/MemoryAccess.cuh", line 180
        /*02e0*/                   ISETP.GE.AND P1, PT, R10, R5, PT ;
	//## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/MemoryAccess.cuh", line 183
        /*02f0*/              @!P1 LDG.E.SYS R6, [R2+0x300] ;
.L_3175:
        /*0300*/                   BSYNC B0 ;
.L_3174:
	//## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/MemoryAccess.cuh", line 180
        /*0310*/                   BMOV.32.CLEAR RZ, B0 ;
        /*0320*/                   BSSY B0, `(.L_3176) ;
        /*0330*/               P0 BRA `(.L_3177) ;
	//## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/MemoryAccess.cuh", line 183
        /*0340*/                   IADD3 R3, P1, R9, R0, RZ ;
        /*0350*/                   LEA.HI.X.SX32 R10, R0, R17, 0x1, P1 ;
        /*0360*/                   LEA R2, P1, R3, c[0x0][0x190], 0x2 ;
        /*0370*/                   LEA.HI.X R3, R3, c[0x0][0x194], R10, 0x2, P1 ;
        /*0380*/                   LDG.E.SYS R11, [R2] ;
	//## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/MemoryAccess.cuh", line 184
        /*0390*/                   IADD3 R10, R0, 0x40, RZ ;
	//## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/MemoryAccess.cuh", line 180
        /*03a0*/                   ISETP.GE.AND P1, PT, R10, R5, PT ;
        /*03b0*/               P1 BRA `(.L_3177) ;
	//## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/MemoryAccess.cuh", line 183
        /*03c0*/                   LDG.E.SYS R13, [R2+0x100] ;
	//## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/MemoryAccess.cuh", line 184
        /*03d0*/                   IADD3 R10, R0, 0x80, RZ ;
	//## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/MemoryAccess.cuh", line 180
        /*03e0*/                   ISETP.GE.AND P1, PT, R10, R5, PT ;
        /*03f0*/               P1 BRA `(.L_3177) ;
	//## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/MemoryAccess.cuh", line 184
        /*0400*/                   IADD3 R10, R0, 0xc0, RZ ;
	//## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/MemoryAccess.cuh", line 180
        /*0410*/                   ISETP.GE.AND P1, PT, R10, R5, PT ;
	//## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/MemoryAccess.cuh", line 183
        /*0420*/                   LDG.E.SYS R10, [R2+0x200] ;
        /*0430*/              @!P1 LDG.E.SYS R15, [R2+0x300] ;
.L_3177:
        /*0440*/                   BSYNC B0 ;
.L_3176:
	//## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/MemoryAccess.cuh", line 193
        /*0450*/               P0 EXIT ;
	//## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/MemoryAccess.cuh", line 196
        /*0460*/                   IADD3 R9, P0, R9, R0, RZ ;
        /*0470*/                   FFMA R11, R11, c[0x0][0x168], R8 ;
	//## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/MemoryAccess.cuh", line 197
        /*0480*/                   IADD3 R14, R0, 0x40, RZ ;
	//## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/MemoryAccess.cuh", line 196
        /*0490*/                   LEA.HI.X.SX32 R12, R0, R17, 0x1, P0 ;
        /*04a0*/                   LEA R2, P0, R9.reuse, c[0x0][0x180], 0x2 ;
	//## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/MemoryAccess.cuh", line 193
        /*04b0*/                   ISETP.GE.AND P1, PT, R14, R5, PT ;
	//## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/MemoryAccess.cuh", line 196
        /*04c0*/                   LEA.HI.X R3, R9, c[0x0][0x184], R12, 0x2, P0 ;
        /*04d0*/                   STG.E.SYS [R2], R11 ;
	//## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/MemoryAccess.cuh", line 193
        /*04e0*/               P1 EXIT ;
	//## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/MemoryAccess.cuh", line 197
        /*04f0*/                   IADD3 R8, R0, 0x80, RZ ;
	//## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/Loops.cuh", line 196
        /*0500*/                   FFMA R13, R13, c[0x0][0x168], R4 ;
	//## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/MemoryAccess.cuh", line 193
        /*0510*/                   ISETP.GE.AND P0, PT, R8, R5, PT ;
	//## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/MemoryAccess.cuh", line 196
        /*0520*/                   STG.E.SYS [R2+0x100], R13 ;
	//## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/MemoryAccess.cuh", line 193
        /*0530*/               P0 EXIT ;
	//## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/MemoryAccess.cuh", line 197
        /*0540*/                   IADD3 R0, R0, 0xc0, RZ ;
	//## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/Loops.cuh", line 196
        /*0550*/                   FFMA R7, R10, c[0x0][0x168], R7 ;
        /*0560*/                   FFMA R15, R15, c[0x0][0x168], R6 ;
	//## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/MemoryAccess.cuh", line 193
        /*0570*/                   ISETP.GE.AND P0, PT, R0, R5, PT ;
	//## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/MemoryAccess.cuh", line 196
        /*0580*/                   STG.E.SYS [R2+0x200], R7 ;
	//## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/MemoryAccess.cuh", line 193
        /*0590*/               P0 EXIT ;
	//## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/MemoryAccess.cuh", line 196
        /*05a0*/                   STG.E.SYS [R2+0x300], R15 ;
        /*05b0*/                   EXIT ;
.L_3178:
        /*05c0*/                   BRA `(.L_3178);
        /*05d0*/                   NOP;
        /*05e0*/                   NOP;
        /*05f0*/                   NOP;
.L_40898:
```

We can clearly see the `LDG.E.128` in it, which is a result of vectorization.

Benchmark: https://github.com/zasdfgbnm/things/blob/master/2020Q1/benchmark-vec.ipynb

Benchmark on P100, dtype `uint8`:

before:
```
1.4.0a0+a5b4d78
e1d97025eeeddcf083e9bee0c8f6a53168991a71
22.2 µs ± 89.8 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
34.7 µs ± 38.2 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
52 µs ± 312 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
86.9 µs ± 135 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
154 µs ± 204 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
291 µs ± 668 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
566 µs ± 1.16 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
1.18 ms ± 1.54 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
2.29 ms ± 1.48 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
4.4 ms ± 1.15 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
```

after:
```
1.4.0a0+a5b4d78
1281cdfd8188fe86241ecaf71d001809d016c3a3
24 µs ± 116 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
30.5 µs ± 355 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
43.1 µs ± 300 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
67.6 µs ± 113 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
116 µs ± 275 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
215 µs ± 142 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
413 µs ± 791 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
824 µs ± 891 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
1.63 ms ± 478 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
3.19 ms ± 1.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
```

Benchmark on P100, dtype `half`:

Before:
```
1.4.0a0+a5b4d78
1c017f0c14c91bd5125ab387a90441b0c0e2f3ad
30.8 µs ± 226 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
43.4 µs ± 164 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
69.1 µs ± 83 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
119 µs ± 103 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
224 µs ± 99.1 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
418 µs ± 206 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
865 µs ± 237 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
1.69 ms ± 695 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
3.3 ms ± 527 ns per loop (mean ± std. dev. of 7 runs, 100 loops each)
6.77 ms ± 741 ns per loop (mean ± std. dev. of 7 runs, 100 loops each)
```

After

```
1.4.0a0+a5b4d78
7e50ee27333e7047072d328d03767b4845286356
28.9 µs ± 61.3 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
40.2 µs ± 244 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
63.8 µs ± 350 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
109 µs ± 196 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
199 µs ± 157 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
380 µs ± 446 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
743 µs ± 2.17 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
1.47 ms ± 1.34 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
2.91 ms ± 9.17 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
5.8 ms ± 296 ns per loop (mean ± std. dev. of 7 runs, 100 loops each)
```

cc: csarofeen ptrblck
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32383

Differential Revision: D19697455

Pulled By: ngimel

fbshipit-source-id: 0707481c2f334e6634c000b4afd275b2fee8fbe1
2020-02-03 16:20:40 -08:00
4baadd54d7 add SpatialBN lowered fake fp16
Summary:
SpatialBNFakeLoweredFp16NNPI

this is the fake operator for SpatialBN that gets lowered into add/mul/div, etc.

Test Plan: test_spatialbn

Reviewed By: tracelogfb, amylittleyang

Differential Revision: D19658680

fbshipit-source-id: 2abddbcd9a2023ac75c494f20eaac2051b7139dc
2020-02-03 15:03:34 -08:00
5c019fede3 [ONNX] Fix for constant folding flaky tests (#32546)
Summary:
Fix for constant folding flaky tests
Looks like the constant folding test modules are sometimes exported with ONNX_ATEN op export type, which is causing the CI failures.
I'm unable to repro this issue locally, but my guess is that the op export param is being overwritten on CI build at some point.
This PR sets the op export type and hopefully fixes the issue.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32546

Reviewed By: hl475

Differential Revision: D19606919

Pulled By: houseroad

fbshipit-source-id: 31793d6857bbbf99b43b4a7c22a045a56ae19e44
2020-02-03 14:23:50 -08:00
a751ddaaa5 Use leaky singletons for torch.distributed. (#32923)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32923

As per
https://isocpp.org/wiki/faq/ctors#construct-on-first-use-v2 and
https://isocpp.org/wiki/faq/ctors#static-init-order-on-first-use-members, we
should be using leaky singletons to avoid static initialization order problem.

Closes https://github.com/pytorch/pytorch/issues/27412
ghstack-source-id: 97601384

Test Plan: waitforbuildbot

Differential Revision: D19688986

fbshipit-source-id: 8c1935fb7da8a7116dbca55eb43dc04bc02695ac
2020-02-03 14:15:18 -08:00
6996f8d880 Add missing default_collate in dataloader.pyi
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/28935

Differential Revision: D19698781

Pulled By: ezyang

fbshipit-source-id: abdd735c98656ed16cd326529441d1fcec2ace3e
2020-02-03 14:01:49 -08:00
1c42b9466b [ONNX] Update support of exporting bool type index mask (#32445)
Summary:
e.g. `tensor[torch.tensor([0, 1, 0], dtype=torch.bool)]`
Previously the mask is of type uint8. Both uint8 and bool should be supported for export.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32445

Reviewed By: hl475

Differential Revision: D19610713

Pulled By: houseroad

fbshipit-source-id: 8df636e0c3cb0b82919a689242a962c79220209c
2020-02-03 13:01:14 -08:00
e03e4f3a2d [ONNX] Add einsum export (#32716)
Summary:
Adding symbolic for onnx einsum as part of opset 12
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32716

Reviewed By: hl475

Differential Revision: D19626168

Pulled By: houseroad

fbshipit-source-id: d8cc8af5f05f36aca3cd55dead602261ccdfec51
2020-02-03 12:56:50 -08:00
167a892e99 Add missing shuffle attribute to DistributedSampler typing file
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/28763

Differential Revision: D19698808

Pulled By: ezyang

fbshipit-source-id: 7820acd7b0715ebf1d9ae954dca0058b6759075e
2020-02-03 12:02:58 -08:00
48eff08256 Fix the level of headers in pytorch/CONTRIBUTING.md (#28412)
Summary:
**Running Clang-Tidy**, **Pre-commit Tidy/Linting Hook**, **Building PyTorch with ASAN** shouldn't belong to **Windows development tips**.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/28412

Differential Revision: D19700228

Pulled By: ezyang

fbshipit-source-id: 39d999c68e4bd9264f4ae1fdab517871c883a663
2020-02-03 11:50:25 -08:00
14c15eb3b0 Py2 -> py3 for caffe2/caffe2/contrib/tensorboard (#32882)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32882

Update tensorboard binary and unit tests to python 3

Test Plan:
```
> buck test //caffe2/caffe2/contrib/tensorboard:tensorboard_test
```
```
> buck test //caffe2/caffe2/contrib/tensorboard:tensorboard_exporter_test
```

Reviewed By: sanekmelnikov

Differential Revision: D19670873

fbshipit-source-id: f5eb65ccbb4ecfdc801b9fa05a60d4c5c29dc428
2020-02-03 11:36:35 -08:00
00c6b90327 Fix in documentation of convolutional modules (#30079)
Summary:
I noticed the description of the initialization of convolutional modules is inconsistent with the actual implementation. There are two such cases:

1) `k` in the initialization of ConvTranspose modules is not dependent on the input channels but on the output channels (`kaiming_uniform_` uses the size of the second dimension of `weight` which is transposed in the first two dimensions).

2) Both the normal convolutions and the transposed ones use `k` divided by `groups`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30079

Differential Revision: D19698511

Pulled By: ezyang

fbshipit-source-id: 1ba938fbbd97663eaf29fd1245872179d2761fff
2020-02-03 11:22:36 -08:00
37953d92d1 raise when jit-load.ing a folder (#27836)
Summary:
Very similar to https://github.com/pytorch/pytorch/issues/16267 but handling directories.

Stoked to contribute!
Pull Request resolved: https://github.com/pytorch/pytorch/pull/27836

Differential Revision: D19698398

Pulled By: ezyang

fbshipit-source-id: eabc3a44d258124f860babb47ab91e22c2c3d6cc
2020-02-03 11:19:57 -08:00
3fa907c145 [docs] Fix argument type of torch.masked_select (#30385)
Summary:
This should be `BoolTensor`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30385

Differential Revision: D19698414

Pulled By: ezyang

fbshipit-source-id: 68f1e10eb9d4b99552bb158f6ad7e6ff0f7cc1c4
2020-02-03 11:15:11 -08:00
10183061eb [ONNX] Update ONNX landing page since 1.3 (#32805)
Summary:
* New ops supported for exporting.
* Updates on support for tensor indexing and dynamic list of tensors.
* lara-hdr, spandantiwari Should we also include updates on torchvision support in this page?

cc houseroad, neginraoof Please review if I have missed anything.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32805

Reviewed By: hl475

Differential Revision: D19635699

Pulled By: houseroad

fbshipit-source-id: b6be4fce641f852dcbceed20b4433f4037d8024a
2020-02-03 10:38:29 -08:00
ef50161ec9 [JIT] Update OVERVIEW.md
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/28870

Differential Revision: D19698758

Pulled By: ezyang

fbshipit-source-id: 23167ec5bf9f7ab81012a124206bb4c2bdd6ca06
2020-02-03 10:32:36 -08:00
7cddc302e5 min, max: check that operand and outputs are on the same device type (#32862)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/32001
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32862

Differential Revision: D19695935

Pulled By: ezyang

fbshipit-source-id: bb37eb7a187214aa69259828024366f479a258d7
2020-02-03 10:16:22 -08:00
b34e0dda24 Emit the C++ version when compiling pytorch from source. (#32819)
Summary:
The need for this is felt because sometimes we change a build script and change the `std=c++XX` flag, which does not get caught until the compilation has progressed for a while.

https://github.com/pytorch/pytorch/issues/31757
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32819

Differential Revision: D19697205

Pulled By: ezyang

fbshipit-source-id: b045a1d15e24c4c6007b5d1464756051d32bf911
2020-02-03 10:12:03 -08:00
c841ab403c add missing method annotations to torch.Tensor (#30576)
Summary:
Looks like some of the tensor methods defined in https://github.com/pytorch/pytorch/blob/master/torch/tensor.py#L393 were missing.

Also add missing self object to `map_`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30576

Differential Revision: D19698355

Pulled By: ezyang

fbshipit-source-id: 6df99f17d5de11715dbe89aecb292612405c08ac
2020-02-03 09:59:14 -08:00
e085c55e53 Fix \\ warnings/errors when building optim documentation (#32911)
Summary:
This PR fixes the warnings and errors attributed to the use of `\\` outside of a proper environment. While rendered correctly in the documentation, it produces the warning
```
LaTeX-incompatible input and strict mode is set to 'warn': In LaTeX, \\ or \newline does nothing in display mode [newLineInDisplayMode]
```
on the CI tools and errors with
```
ParseError: KaTeX parse error: Expected 'EOF', got '\\' at position (x): ...
```
when not set to warn.

This PR also makes minor formatting adjustments. The `CosineAnnealingLR` documentation has been adjusted to remove an unnecessarily large fraction and to improve spacing. The `SGD` documentation has been adjusted so that variables are consistently typeset and so that it follows the convention of punctuating equations. I attached images of the current documentation, the new documentation and a marked version to highlight differences.

* SGD:
New: ![new_sgd](https://user-images.githubusercontent.com/53704971/73596383-98795500-44d6-11ea-97ce-bac02a0a1638.png)
Current: ![current_sgd](https://user-images.githubusercontent.com/53704971/73596384-98795500-44d6-11ea-86d3-b407cebbb513.png)
Marked new: ![marked_sgd](https://user-images.githubusercontent.com/53704971/73596385-98795500-44d6-11ea-9e06-9ac5e5e27270.png)

* CosineAnnealingLR:
New: ![new_calr](https://user-images.githubusercontent.com/53704971/73596382-98795500-44d6-11ea-9c90-02406d297bae.png)
Current: ![current_calr](https://user-images.githubusercontent.com/53704971/73596387-9911eb80-44d6-11ea-93fb-ee72d695312a.png)
Marked new: ![marked_calr](https://user-images.githubusercontent.com/53704971/73596386-9911eb80-44d6-11ea-91a6-ed7a62b4e255.png)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32911

Differential Revision: D19697114

Pulled By: ezyang

fbshipit-source-id: 567304bd4adcfa4086eae497cb818cf74375fe5d
2020-02-03 09:54:38 -08:00
7101f6b5c0 Properly handle NaN in binary max and min (#32541)
Summary:
The output depends asymmetrically on whether the first or the second
argument is NaN. See https://github.com/pytorch/pytorch/issues/25016 for detail of the issue.

This is part of a continuing effort that was dropped in https://github.com/pytorch/pytorch/issues/30851

The failure in https://github.com/pytorch/pytorch/issues/27185 is resolved by explicitly casting a half type number to float when applying `isnan`.

Close https://github.com/pytorch/pytorch/issues/25016
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32541

Differential Revision: D19644643

Pulled By: VitalyFedyunin

fbshipit-source-id: 8d49e6ed5a9996a817df7a9419dc5eee601430bc
2020-02-03 09:04:39 -08:00
e87887ccb4 Update type hints for torch.optim.optimizer.Optimizer (#32900)
Summary:
This PR fixes type hints for `torch.optim.optimizer.Optimizer` object, issue also reported in https://github.com/pytorch/pytorch/issues/23731

To test things I used following optimiser implementation, that is fully covered with type hints:

```python
from typing import Optional, Callable, Union, Iterable

from torch import Tensor
from torch.optim.optimizer import Optimizer

OptClosure = Optional[Callable[[], float]]
_params_t = Union[Iterable[Tensor], Iterable[dict]]

class SGD(Optimizer):
    def __init__(self, params: _params_t, lr: float = 0.1) -> None:
        defaults = dict(lr=lr)
        super(SGD, self).__init__(params, defaults)

    def __setstate__(self, state: dict) -> None:
        super(SGD, self).__setstate__(state)

    def step(self, closure: OptClosure = None) -> Optional[float]:
        loss = None
        if closure is not None:
            loss = closure()

        for group in self.param_groups:
            for p in group['params']:
                if p.grad is None:
                    continue
                d_p = p.grad.data
                p.data.add_(-group['lr'], d_p)
        return loss
```

Without fix `mypy` reports bunch of inconsistencies in types and missing properties:

```bash
$ mypy  torch_optimizer/sgd.py
torch_optimizer/sgd.py:14: error: Too many arguments for "__init__" of "Optimizer"
torch_optimizer/sgd.py:17: error: "__setstate__" undefined in superclass
torch_optimizer/sgd.py:19: error: Return type "Optional[float]" of "step" incompatible with return type "None" in supertype "Optimizer"
torch_optimizer/sgd.py:24: error: "SGD" has no attribute "param_groups"
Found 4 errors in 1 file (checked 1 source file)
```

with fix not issues:
```bash
$ mypy  torch_optimizer/sgd.py
Success: no issues found in 1 source file
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32900

Differential Revision: D19697175

Pulled By: ezyang

fbshipit-source-id: d5e2b3c421f69da3df8c32b3d53b4b6d15d61a41
2020-02-03 09:00:01 -08:00
29e6f13cd1 Enable MKL on MacOS if installed (#32905)
Summary:
Fix cmake script that missed MKL directories

Signed-off-by: caozhong <zhong.z.cao@intel.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32905

Differential Revision: D19688496

Pulled By: ezyang

fbshipit-source-id: d04a608eea5f983e153a48b0b1eb0390aebbe6c0
2020-02-02 14:57:43 -08:00
f8dd65f2a1 Updating submodules
Summary:
GitHub commits:

e384ddc186

Test Plan: n/a

Reviewed By: 2d2d2d2d2d

fbshipit-source-id: 18d4371821439388a6b546a1953c31856c80ec85
2020-02-02 14:56:10 -08:00
ff0ba563d5 Updating submodules
Summary:
GitHub commits:

6eb4ee98ba

Test Plan: n/a

Reviewed By: 2d2d2d2d2d

fbshipit-source-id: 74dda0be26516756cd4d4d2df2167392fc48074a
2020-02-02 12:22:16 -08:00
71ad88199a Clarify the searched string is displayed in the error message
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/32789

Differential Revision: D19646635

Pulled By: suo

fbshipit-source-id: 18233fee7c75f7da2a1826fb66f78a519e6d9c77
2020-02-01 17:24:37 -08:00
b564eaf7a8 Bug fixes: torch::tensor(floating-point values) -> default dtype, and torch::tensor(integer values) ->at::kLong (#32367)
Summary:
Some of the `torch::tensor` behavior is updated to better match Python API. Fixes https://github.com/pytorch/pytorch/issues/32234.

This PR is BC-breaking in the following way:
- `torch::tensor({1.0f, 2.0f})`: float -> default dtype
- `torch::tensor(at::ArrayRef<int>({1, 2, 3}))`: int -> at::kLong
- `torch::tensor(std::vector<int>({1, 2, 3}))`: int -> at::kLong
- `torch::tensor(at::ArrayRef<float>({1.f, 2.f, 3.f}))`: float -> default dtype
- `torch::tensor(std::vector<float>({1.f, 2.f, 3.f}))`: float -> default dtype
- `torch::tensor(at::ArrayRef<double>({1., 2., 3.}))`: double -> default dtype
- `torch::tensor(std::vector<double>({1., 2., 3.}))`: double -> default dtype
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32367

Differential Revision: D19498484

Pulled By: yf225

fbshipit-source-id: 19c8dc2a56476266153cff4c404e7f84d309eb12
2020-02-01 15:00:07 -08:00
4cc6e6bbbe Adding scalar to the c10 registration type check
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/32886

Test Plan: Imported from OSS

Differential Revision: D19673484

Pulled By: z-a-f

fbshipit-source-id: ea8478a4fe6788dcb044ec1ab7d51dc50ab3fa60
2020-02-01 13:15:50 -08:00
ce07fb26c0 Updating submodules
Summary:
GitHub commits:

3f4acb24bb
930ea23548
c0c5daf3db

Test Plan: n/a

Reviewed By: 2d2d2d2d2d

fbshipit-source-id: 878178c5412375d74e7f64d7e4142f57ddbc931f
2020-02-01 13:14:30 -08:00
c83f984906 Updating submodules
Summary:
GitHub commits:

5adba3596a
d8b4f2ff66
daa254211a
9c4684ff10
fdb82b21cb

Test Plan: n/a

Reviewed By: 2d2d2d2d2d

fbshipit-source-id: 4e74f7e888cc2004ba937d3bb253645fbd2388c5
2020-01-31 23:24:51 -08:00
040bc1d0e1 [JIT] make is_scripting a condvalue (#32871)
Summary:
Add `torch.jit.is_scripting` to the list of CondValues, or values that if they are an input to a if statement we only compile one side of the if. I'm not sure if we actually want this PR.

Pros:
- Makes it easier to add features that are not yet supported in TorchScript (like has_torch_function)
- The current idiom of writing `torch.jit.is_scripting` and factoring out the block to a function annotated with `torch.jit.ignore` is functionally equivalent and much more cumbersome

Cons:
- Makes it easier to add features that are not yet supported in TorchScript
- Perhaps is confusing as a reader what is being compiled. Potentially could give all caps name or otherwise change name to make it more visually stand out.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32871

Differential Revision: D19670383

Pulled By: eellison

fbshipit-source-id: 5257b0bd23c66f199d59a7f2c911e948301e5588
2020-01-31 18:23:42 -08:00
4d7ab255d3 [PyTorch][TorchScript] Add support for join on List of strings in TorchScript (#32847)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32847

Add support for join on List of strings in TorchScript.

Test Plan:
(pytorch) smummadi@smummadi-mbp pytorch % python test/test_jit_string.py
Fail to import hypothesis in common_utils, tests are not derandomized
.
Ran 1 test in 1.090s
OK

Differential Revision: D19650809

fbshipit-source-id: 387a8f0e3cc3111fd3dadd3d54c90fc8c7774cf9
2020-01-31 18:20:38 -08:00
144eb59756 [rpc] don't crash callee when function does not exist on it, instead return Exception (#32726)
Summary:
Closes https://github.com/pytorch/pytorch/issues/27368.
Previously, if a function `'func` did not exist on worker A but existed in B, and the user ran `rpc.rpc_sync(A,  func)`, A would crash with a segmentation fault since it is not able to find the function. B would eventually timeout since RPCs by default time out in 60s.

At the root this comes from an unhandled exception when trying to deserialize the `PythonUDF` to run.

This PR makes it so that we can recover from this error, and A reports back a `RemoteException` to B indicating that the function was not found. Now, A will no longer crash and B can handle the exception appropriately and with more information.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32726

Differential Revision: D19648825

Pulled By: rohan-varma

fbshipit-source-id: 53847f4bfb68187db41c61d69ddac13613e814b4
2020-01-31 18:02:12 -08:00
a8d39a7937 Updating submodules
Summary:
GitHub commits:

e0fd90427f
c892e21dc6
3cdc99f2b2
800d24ddc5
74326cdb3c
e4af160c09
6c2fb05f6d
a0555ecf37
e4122f77fc

Test Plan: n/a

Reviewed By: 2d2d2d2d2d

fbshipit-source-id: 9e3e0a7231c3e5cc0167cd935541dd7a8a4ea84d
2020-01-31 17:56:39 -08:00
4493b10500 [PyTorch] Gate out mobile operator logging observer.
Summary: Introduce separate gating for mobile operator logging observer.

Reviewed By: ljk53

Differential Revision: D19665993

fbshipit-source-id: b81a228c55110a02edb8c2b6f9fd02e750b2ad69
2020-01-31 17:25:53 -08:00
10bd21d550 [JIT] fix nested select assign (#32877)
Summary:
Fix for https://github.com/pytorch/pytorch/issues/31902

```
self.sub.a = 1
 ```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32877

Differential Revision: D19670322

Pulled By: eellison

fbshipit-source-id: 6d8f350b4d1169be1d2a56050fccd7c246ad9212
2020-01-31 16:58:26 -08:00
ad78c0f4fc Fixed the flaky test_rref_context_debug_info (#32749)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32749

The test was flaky since the message from owner RRef confirming fork would arrive after the test checked whether the pending User RRefs map was empty - leading to an assertion error. This diff creates a utility function that should be used by any test to wait for this message to complete processing before doing any assertions related to the pending User RRefs map.

GitHub Issue: https://github.com/pytorch/pytorch/issues/30988

Test Plan: Stress tested `test_rref_context_debug_info` 200 times.

Differential Revision: D19612289

fbshipit-source-id: 57a7c19b1cf792b94c263d3efbbbb6da60c07d07
2020-01-31 16:53:18 -08:00
d03c9aaa05 Fix upsampling test case on ppc (#32786)
Summary:
Power and x86 are giving slightly different results when scaling images up using `torch.nn.functional.interpolate` and when using OpenCV's `resize`. This is causing `test_upsampling_not_recompute_scale_factor` to fail on Power, but not x86. This changes the expected value to what OpenCV on Power produces if the test case is running on Power as well.

See https://github.com/pytorch/pytorch/issues/31915

ezyang
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32786

Differential Revision: D19672053

Pulled By: ezyang

fbshipit-source-id: 3497f852bdc6d782646773792f9107c857c7b806
2020-01-31 16:40:56 -08:00
fe01376ffe [JIT] namedtuple constants (#32873)
Summary:
If there was a namedtuple with immutable constant inputs, that was also the input / output of a function which expected a namedtuple it would fail. Fix by using namedtuple constructor on serialization. (no one has run into this bug yet).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32873

Differential Revision: D19668807

Pulled By: eellison

fbshipit-source-id: bae33506e53b6a979b4e65a3e7c989b1408c98f4
2020-01-31 15:25:31 -08:00
fbe121e395 Quantized sigmoid function
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/31851

Test Plan: Imported from OSS

Differential Revision: D19280716

Pulled By: z-a-f

fbshipit-source-id: f47d37e32a675756fcaca293e2c14f90c43891de
2020-01-31 14:40:21 -08:00
7b65acdf9e Solves Issue #32750 - torch.prod now works fine with FP16 Input Tensor and FP32 Output Tensor (#32831)
Summary:
This PR solves Issue https://github.com/pytorch/pytorch/issues/32750.

- Changes function prod_kernel_impl to use `out_t` argument instead of `scalar_t` (which caused the garbage output for FP16 input and FP32 output tensor type).
- Adds test case for `torch.prod` (for CUDA): tests both `torch.prod` and `torch.tensor.prod`. Checks all the combinations for dtypes: `torch.float16` and `torch.float32`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32831

Differential Revision: D19664666

Pulled By: ngimel

fbshipit-source-id: c275363355c832899f10325043535949cd12b2f8
2020-01-31 14:25:08 -08:00
8ddd5bb0e9 Don't serialize None values in observer (#32733)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32733

Similar to https://github.com/pytorch/pytorch/pull/32318, we should stop serializing None values since they can't be broadcasted

Test Plan: Imported from OSS

Differential Revision: D19611586

Pulled By: jerryzh168

fbshipit-source-id: 369881de0567ed8eb25bdada892227f49bb5b29d
2020-01-31 13:28:43 -08:00
1760d5b83c Remove wrap_dim from codegen layer. (#32738)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32738

This is to simplify the codegen layer, with the goal of making it simple enough to just check in.

Test Plan: Imported from OSS

Differential Revision: D19610927

Pulled By: gchanan

fbshipit-source-id: 760734f579b1f655775e6d270918c361985f3743
2020-01-31 13:13:35 -08:00
660a93c558 Code cleaning: Some iterating variables in builtin_functions.cpp can be const (#32852)
Summary:
To suppress a clang-tidy warning:

    torch/csrc/jit/script/builtin_functions.cpp#L89

    [performance-for-range-copy] warning: loop variable is copied but only
    used as const reference; consider making it a const reference

Also make the const qualifier of scalar explicit.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32852

Differential Revision: D19663277

Pulled By: ezyang

fbshipit-source-id: f4ec5688d3cbea9a5f40db6063b7d111b0bf0cce
2020-01-31 12:55:20 -08:00
ada966b7d7 [pytorch] avoid thread_local std::vector<Call> for mobile build (#32849)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32849

We learned that Android NDK's gcc + gnustl combination might produce a
use-after-free for thread_local variables with non-trivial destructors.

This PR removes such a thread_local use case from error_report.cpp for mobile build,
which is the only case included in mobile lite-JIT build.
ghstack-source-id: 97491327

Test Plan: - CI

Reviewed By: dreiss

Differential Revision: D19652702

fbshipit-source-id: ee8d316ad5c6e6c8a8006eb25f3bba1618dd7e6d
2020-01-31 12:48:57 -08:00
d9e99ab544 Loops.cuh legacy code cleanup -- gpu_kernel_with_index (#32777)
Summary:
I didn't see any use case where the functor of `gpu_kernel_with_index` needs to have argument other than the index. Merge conflict with https://github.com/pytorch/pytorch/pull/32755.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32777

Differential Revision: D19646381

Pulled By: ngimel

fbshipit-source-id: 81d2be74170457e39943274e3689845e83758bfa
2020-01-31 12:02:50 -08:00
fd3bd7777d Updating submodules
Summary:
GitHub commits:

01fc273e29
53222db222
dea724242e
3dd493b166
ec496347bc
03f4ec299e

Test Plan: n/a

Reviewed By: 2d2d2d2d2d

fbshipit-source-id: e362b5df2099f1c3dd2ef7702d4bbd5bb85e4b27
2020-01-31 11:54:30 -08:00
b16dab8a41 Coding header is better specified in lowercase letters (#32850)
Summary:
The Python document <https://www.python.org/dev/peps/pep-0263/> gives
all examples using lowercase letters. Although it doesn't say
straightly, the following paragraph seems to indicate that uppercase
letters aren't legitimate:

> If a source file uses both the UTF-8 BOM mark signature and a magic encoding comment, the only allowed encoding for the comment is 'utf-8'.  Any other encoding will cause an error.

My Emacs also complains about the uppercase letters every time I save
the file.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32850

Differential Revision: D19663281

Pulled By: ezyang

fbshipit-source-id: 48127d3c2fd6e22dd732a2766913735136ec2ebc
2020-01-31 10:02:30 -08:00
22466552e3 Updating submodules
Summary:
GitHub commits:

edc4a4f551
72c7112964
62c8286307

Test Plan: n/a

Reviewed By: 2d2d2d2d2d

fbshipit-source-id: 92dd070a28091dda81e315591d6d12cddfecf00f
2020-01-31 10:01:15 -08:00
ed10408cc6 Updating submodules
Summary:
GitHub commits:

a3394d248c
91f92d0106
e50c78af57
d49bb54c3d
504fda5cda
42086f8764
d5b454a9c0
0e31e0a8b0

Test Plan: n/a

Reviewed By: 2d2d2d2d2d

fbshipit-source-id: 7ce9d3444d653c6889ffe080425aa082c33f137a
2020-01-30 22:05:39 -08:00
03557a9838 Make save_for_lite_interpreter private (#32771)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32771

It's a patch to #32621, make the api private.

Test Plan: Imported from OSS

Differential Revision: D19657307

Pulled By: iseeyuan

fbshipit-source-id: e604a0cbed6a1e61413daaafc65bea92b90f1f5d
2020-01-30 21:01:54 -08:00
c3b4bfcfed Add knobs to set the number of profiling runs and bailout depth (#32735)
Summary:
Diagnostic API to simplify debugging and experiments.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32735

Differential Revision: D19626708

Pulled By: Krovatkin

fbshipit-source-id: aa8c0da94d4559329fd7c8093329aea4e0271b6a
2020-01-30 18:50:56 -08:00
12bcfa7c77 Remove Python dependency (toPyTuple/fromPyTuple, jitCompilationUnit, deserialize) in rref_impl.h/cpp (#32753)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32753

Functions to be bound as an Aten operator could not have Python dependency.

This is to refactor and remove Python dependency.
ghstack-source-id: 97485800

Test Plan:
```
buck test mode/dev-nosan //caffe2/test/distributed/rpc:rpc_fork -- test_script_functions_not_supported

buck build mode/dev-nosan //caffe2/test/distributed/rpc:rpc_fork

buck-out/gen/caffe2/test/distributed/rpc/rpc_fork\#binary.par -r test_script_functions_not_supported
```

```
buck test mode/dev-nosan //caffe2/test/distributed/rpc:dist_autograd_fork

buck build mode/dev-nosan //caffe2/test/distributed/rpc:dist_autograd_fork

buck-out/gen/caffe2/test/distributed/rpc/dist_autograd_fork\#binary.par -r test_backward_simple_script_call
```

Differential Revision: D5741675

fbshipit-source-id: 31ee60955be8d815d0773f3699e3ff2f1f9d8849
2020-01-30 17:52:48 -08:00
29fabb1fbc make tests for empty inputs check zero parameter grads (#32820)
Summary:
Make batch norm with empty inputs return zero parameter gradients. Now batch norm, group norm and convolutions now return zero grads for parameters, so make tests check that. Fixes some bullet points in https://github.com/pytorch/pytorch/issues/12013 (interpolate is not fixed by this PR, is being fixed in other PRs)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32820

Differential Revision: D19651470

Pulled By: ngimel

fbshipit-source-id: 96fdd085f9b0e98e91217dd2ac1f30f9c482b8be
2020-01-30 17:42:55 -08:00
bc2e05a398 Update Docs for building PyTorch for Android.
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/32578

Reviewed By: ljk53

Differential Revision: D19588904

Pulled By: dreiss

fbshipit-source-id: 2934752b9c5b94f2f141417669d8385be44d703b
2020-01-30 17:12:03 -08:00
fcf9fcedf4 Remove needs_dynamic_casting from TensorIterator and move it to Loops.cuh (#32755)
Summary:
Remove `needs_dynamic_casting` from TensorIterator and move it to `Loops.cuh`.

The original design of `needs_dynamic_casting` is fundamentally flawed: it injects logics into TensorIterator and uses a bunch of boolean values to test whether the dynamic casting is needed. This makes it very fragile, as the TensorIterator is so complicated and it is easy to introduce unnecessary dynamic casts. It also makes the `gpu_kernel` very unflexible, differently cases needs to manipulate TensorIterator to make it work.

For example, currently
```python
torch.zeros(10, device='cuda').mul_(0.9)
```
needs dynamic cast, but it shouldn't.

Testing whether dynamic casting is needed could be easy: just compare the dtypes of the lambda with the dtypes of operands. If they don't match, then dynamically cast, otherwise don't cast.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32755

Differential Revision: D19644092

Pulled By: ngimel

fbshipit-source-id: 130bb8bd78d20c2ed1bdfc9d9fb451eb0f0c7e55
2020-01-30 17:06:23 -08:00
0f0972051a Cudnn bn size fix (#32763)
Summary:
Should fix https://github.com/pytorch/pytorch/issues/29744 by falling back to native batch norm implementation, if cudnn cannot execute the provided shape.

Shape numbers were verified for cudnn 7.6.5.32 with tensor shapes:
```python
# for spatial bn
x = torch.Size([880801, 256, 5])
x = torch.Size([65535, 256, 5])
x = torch.Size([880801, 64, 4, 4])
x = torch.Size([65535, 64, 4, 4])

# for per-act bn
x = torch.Size([131070, 2048])
x = torch.Size([262136, 2048])
```
for `training()` and `eval()` mode using `torch.float32` and `torch.float16`.

I've increased the shape of our current smoke test to, but I can also add all use cases of the support matrix, if wanted.

CC ngimel
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32763

Differential Revision: D19644328

Pulled By: ngimel

fbshipit-source-id: c2151bf9fe6bac79b8cbc69cff517a4b0b3867aa
2020-01-30 16:57:15 -08:00
bcb7c22679 [PyTorch BC] Fix the ci (#32843)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32843

fix the ci by skipping aten::join

Test Plan: ci

Reviewed By: hl475

Differential Revision: D19650584

fbshipit-source-id: 4446eef568ded334217ff9205a795daffebe41a1
2020-01-30 16:05:03 -08:00
5380e16db9 Updating submodules
Summary:
GitHub commits:

73638a8795
7a83deaa83
969d173d11

Test Plan: n/a

Reviewed By: wittgenst

fbshipit-source-id: 399ed7a972876727a6bfd1409667c735c406fef5
2020-01-30 15:41:49 -08:00
765904f1b9 [torch] fd error check
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/32797

Differential Revision: D19642262

Pulled By: mrshenli

fbshipit-source-id: 1720812166dd583dca6d72cb7e24b65ec013a62b
2020-01-30 15:30:03 -08:00
94ddc2c462 Resubmit more code fakefp16 mapping unification (#32798)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32798

ATT

Test Plan: unittests

Reviewed By: amylittleyang

Differential Revision: D19632251

fbshipit-source-id: 670004050d67415bb24392f3520afa32b64ce740
2020-01-30 12:48:48 -08:00
690d41f24e Centralize addition of "always on" dispatch keys. (#32734)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32734

VariableTensorId is the only key with this treatment today,
but BackendSelect and CompoundOp are coming soon.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Differential Revision: D19628091

Pulled By: ezyang

fbshipit-source-id: 250753f90528fa282af7a18d8d2f7736382754bd
2020-01-30 11:49:40 -08:00
5ddd2cd92b Make DispatchKeyGuards accept DispatchKey::Undefined (#32729)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32729

When working on the vmap prototype I noticed that this was helpful
as it lets me easily initialize a no-op guard, if I need to do it
at constructor time (which I usually do, because the guards don't
have move constructors).

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Differential Revision: D19628092

Pulled By: ezyang

fbshipit-source-id: d6259a3f70d287cdac2e4a5f3984e2880f19bdc2
2020-01-30 11:49:35 -08:00
3d0a470d89 Rename DispatchKey::UndefinedTensorId to Undefined (#32728)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32728

It doesn't have much to do with tensors anymore.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Differential Revision: D19628093

Pulled By: ezyang

fbshipit-source-id: 4d57111cdf44ba347bec8a32bb5b4b47a83c1eaf
2020-01-30 11:47:40 -08:00
a40a19ccab Remove GIL from RRefContext (#32807)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32807

After this commit, RRefContext no longer depends on pybind.

Test Plan: Imported from OSS

Differential Revision: D19636316

Pulled By: mrshenli

fbshipit-source-id: 88faa101c32e9019e979ae8e5da6706e49842726
2020-01-30 10:53:25 -08:00
413c0f6c29 Fixes moving after weight norm application (#32563)
Summary:
This PR updates how RNNs handle their "flat weights." In particular, it allows for only some flat weights to be "materialized" when apply is called, and it updates the flattening behavior to only apply if all flat weights are (1) materialized, (2) share a dtype and (3) are acceptable to cuDNN.

One test is modified and another created to test these changes. One practical effect of this change is that weight norm can be successfully applied to a module BEFORE that module is moved to an accelerator. Previously doing so would throw an error.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32563

Differential Revision: D19602725

Pulled By: mruberry

fbshipit-source-id: d8f9441d17815c8c9ba15b256d4be36f784a3cf9
2020-01-30 10:31:11 -08:00
9bab617b3e Make python version a parameterizable option for Windows CI.
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/32823

Differential Revision: D19642347

Pulled By: ezyang

fbshipit-source-id: a4d461aa29a06bb7f5e5d359a2df2c90e9a4fd41
2020-01-30 08:16:43 -08:00
cc35c876cb Fix backcompat for linear_relu_dynamic_fp16 (#32803)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32803

Stack from [ghstack](https://github.com/ezyang/ghstack):
* **#32803 Fix backcompat for linear_relu_dynamic_fp16**

Test Plan: Imported from OSS

Reviewed By: jamesr66a

Differential Revision: D19642281

Pulled By: albanD

fbshipit-source-id: 3b6ae4dd81bf8a70dd81ccbb02fffd7653bbd08c
2020-01-30 08:08:29 -08:00
fa65859270 Re-enable non-deterministic autograd tests
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/32793

Test Plan: Imported from OSS

Differential Revision: D19634632

Pulled By: albanD

fbshipit-source-id: 9dda29536c2ed4afb81ecbea471ba615241bbac2
2020-01-30 08:00:19 -08:00
85bd3e5bdb Removing @expectedFailureXLA from test_nll_loss_empty_tensor_reduction_mean (#32701)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32701

Because it's disabled in XLA(https://github.com/pytorch/xla/pull/1563)
Discussed in https://github.com/pytorch/xla/issues/1539

Test Plan: Imported from OSS

Differential Revision: D19633349

Pulled By: pbelevich

fbshipit-source-id: b9a81c976a96b325356ff210ff838dfcd5352db7
2020-01-30 07:38:12 -08:00
6874278985 Revert D19611800: [PyTorch][TorchScript] Add support for join on List of strings in TorchScript
Test Plan: revert-hammer

Differential Revision:
D19611800

Original commit changeset: cef66356abc1

fbshipit-source-id: 41af9e0de83b1fb808b17255ec905e137909457d
2020-01-30 06:46:28 -08:00
b0923acb29 Reduce RPC branches for Python/BuiltinOp/TorchScript (#32689)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32689

As described in https://github.com/pytorch/pytorch/issues/32565
ghstack-source-id: 97440343

Test Plan:
```
buck test mode/dev-nosan //caffe2/test/distributed/rpc:rpc_fork -- test_script_functions_not_supported

buck build mode/dev-nosan //caffe2/test/distributed/rpc:rpc_fork

buck-out/gen/caffe2/test/distributed/rpc/rpc_fork\#binary.par -r test_script_functions_not_supported
```

```
buck test mode/dev-nosan //caffe2/test/distributed/rpc:dist_autograd_fork

buck build mode/dev-nosan //caffe2/test/distributed/rpc:dist_autograd_fork

buck-out/gen/caffe2/test/distributed/rpc/dist_autograd_fork\#binary.par -r test_backward_simple_script_call
```

Differential Revision: D5721814

fbshipit-source-id: 9079e81764be1e7c7b85dd72a18c76f3ecfd2547
2020-01-30 01:19:35 -08:00
affd598c1f Fix/simplify alias annotation handling in op codegen. (#32574)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32574

Previously, we ignored alias annotations when deriving argument mutability
and instead recognized particular signature patterns (in-place, out variant)
and assigned mutability accordingly. Op signatures that didn't fit these
patterns would error (e.g. see #30526, which this fixes).

No change in the generated binding code.

Code changes:
1. in function_wrapper.py, fix the mutability derivation logic used when creating an argument's c++ type property. Note that we temporarily need to trap a special case and apply the old logic, see code comment for details.

2. in gen_jit_dispatch.py, update logic that assumed only one mutable Tensor argument per declaration. Happily this mostly was accomplished by bypassing some now-redundant signature regeneration machinery. Another special case here requires that we keep the old machinery around temporarily.

Test Plan: Imported from OSS

Differential Revision: D19564875

Pulled By: bhosmer

fbshipit-source-id: 5637a9672923676d408c9586f3420bcc0028471a
2020-01-30 00:31:03 -08:00
fb159b5236 Some work on eager op binding codegen (gen_python_functions.py) (#29986)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/29986

Previously in addition to generating a python binding for each op,
we would generate an almost-trivial helper for each overload.
This PR eliminates the helpers, simplifying codegen logic a bit and
reducing the source-level indirection by a step.
Perf should be unchanged.

codegen diff: 1f2f07fb60

Note: in the interests of keeping the diff contained, there's only
some light cleanup here beyond what's necessary for the codegen changes.
Plan is to do some more substantial refactoring in followup PRs that
leave generated code unchanged.

Test Plan: Imported from OSS

Differential Revision: D18567980

Pulled By: bhosmer

fbshipit-source-id: eb9a81babb4489abd470842757af45580d4c9906
2020-01-30 00:29:53 -08:00
821b6aa769 [pytorch] Minor: avoid acquiring GIL twice in PyRRef::localValue() (#32785)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32785

Add PythonRpcHandler::handleExceptionWithGIL() so that in PyRRef::localValue(),
we don't need to release the GIL and re-acquire the following line.
ghstack-source-id: 97418465

Test Plan: existing test coverage

Differential Revision: D19626195

fbshipit-source-id: db694d04b078811f819626789e1e86f1b35adb5b
2020-01-29 21:27:43 -08:00
c2d736cefb Add support for Dynamic LSTM quantization on Mobile (#32757)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32757

This PR updates the main quantize_dynamic API to use QNNPACK backend for mobile

Test Plan:
python test/test_quantization.py PostTrainingDynamicQuantTest.test_quantized_rnn

Imported from OSS

Differential Revision: D19632220

fbshipit-source-id: b4c51485c281d088524101b97c84dd806438b597
2020-01-29 20:55:48 -08:00
55c382e62b Fixed access to element in size tensor for scripting (#32652)
Summary:
when using scripting, there was an error in attempting to access a
specific element from within the size tensor.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32652

Reviewed By: hl475

Differential Revision: D19610726

Pulled By: houseroad

fbshipit-source-id: bca49927bbe71dbe7e7d7edf301908fe79e089b5
2020-01-29 18:33:46 -08:00
8ead65a946 [PyTorch][TorchScript] Add support for join on List of strings in TorchScript
Summary: Add support for join on List of strings in TorchScript.

Test Plan:
(pytorch) smummadi@smummadi-mbp pytorch % python test/test_jit_string.py
Fail to import hypothesis in common_utils, tests are not derandomized
.
----------------------------------------------------------------------
Ran 1 test in 1.090s

OK

Differential Revision: D19611800

fbshipit-source-id: cef66356abc14dfd100a806d25dd1a8bc9af0a11
2020-01-29 18:22:52 -08:00
cccf5e7011 Resolve rendezvous race condition
Summary:
When running the ctr_mbl_feed, we've encountered hang issue related to the rendezvous handshake based on zeus. It was mitigated by this diff https://our.intern.facebook.com/intern/diff/D19167151/.

This diff resolves the race condition by adding a reference to the rendezvous handler.

Test Plan: x7340282797

Reviewed By: yifuwang

Differential Revision: D19627293

fbshipit-source-id: 560af289db8ef6cf8d6f101f95ec27d5a361fd04
2020-01-29 17:49:07 -08:00
3552be1090 [jit] fix the NoneType param/buffer hack (#32745)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32745

Some parameters (like `bias` in conv) are optional. To achieve this
previously, you had to add `bias` as a constant, which would invoke some
pretty weird behavior in the frontend, summarized as:
```
if bias is not None:
  add it as a parameter normally
else: # bias is None
  add it as a constant with the value None
```

There are several things bad about this:
1. Bias is not a constant. Marking it `__constants__` is confusing.
2. It basically relies on an implementation detail (the frontend
processes parameters before constants) to work.

Okay, whatever. I don't even know why we did this originally, but
getting rid of it doesn't break anything, so I assume improved NoneType
refinement has made this a non-issue.

Note on perf: this will make no difference; if bias was `None` it's still
folded out today, if bias is a Tensor it would be added as a parameter
both before and after this change

Test Plan: Imported from OSS

Differential Revision: D19628634

Pulled By: suo

fbshipit-source-id: d9128a09c5d096b938fcf567b8c23b09ac9ab37f
2020-01-29 17:04:39 -08:00
2e359ef86d enable empty batch for all flavor of convolutions (#32709)
Summary:
resubmitting https://github.com/pytorch/pytorch/issues/32612 after a merge gone wrong. Enables convolution with an empty batch or number of channels for all flavors of convolution (grouped convolution, convTranspose). Would make https://github.com/pytorch/pytorch/issues/31658 unnecessary. Also returns zero gradients for the parameters, that's necessary for correct DDP operation.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32709

Differential Revision: D19627968

Pulled By: ngimel

fbshipit-source-id: 7359759bd05ff0df0eb658cac55651c607f1b59f
2020-01-29 16:33:48 -08:00
a840afbeb4 [pytorch][embeddingbag_8bit] Add include_last_offset option to Fused 8bit EmbeddingBag and parallelize the op (#32683)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32683

Pull Request resolved: https://github.com/pytorch/glow/pull/4079

Similar to D17768404, we changed the EmbeddingBag operator for 8-bit fused version to add the option to include the last offset and parallelize the op.
ghstack-source-id: 97404645

Test Plan:
To generate the AVX2 code (`embedding_lookup_fused_8bit_rowwise_idx_avx2.cc`):
```
python hp_emblookup_codegen.py --fused --use-offsets
```

To test the correctness:

```
buck test //caffe2/torch/fb/sparsenn:test -- test_embedding_bag_byte_rowwise_offsets  --print-passing-details
```

Reviewed By: yinghai

Differential Revision: D19592761

fbshipit-source-id: f009d675ea3f2228f62e9f86b7ccb94700a0dfe0
2020-01-29 16:04:56 -08:00
b565d9b356 Logspace fixes (#32744)
Summary:
Reopening of PR https://github.com/pytorch/pytorch/issues/32631 with `viable/strict` base for testing
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32744

Differential Revision: D19626090

Pulled By: ngimel

fbshipit-source-id: ed0fc759198ee2edc23afdcb1e190a11d70ec4c8
2020-01-29 15:17:00 -08:00
fc2ff7912f [quantization] Remove incorrect fp16 dynamic linear/relu op
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/32774

Test Plan: Imported from OSS

Differential Revision: D19624471

Pulled By: jamesr66a

fbshipit-source-id: eb6cb11fabf2ddd5edf345aff35b86b83c3af94c
2020-01-29 14:50:24 -08:00
9357b91180 Remove -Werror from test/cpp_extensions/setup.py (#32704)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32704

-Werror is too aggressive check for test cpp extensions because it fails even on deprecation warnings which is are included from core codebase.

Fixes #32136

Test Plan: Imported from OSS

Differential Revision: D19620190

Pulled By: pbelevich

fbshipit-source-id: 0e91566eb5de853559bb59e68a02b0bb15e7341b
2020-01-29 14:12:32 -08:00
8b187e8f2a Fix ivalue_inl.h:353:29: warning: comparison of unsigned expression >= 0 is always true (#32778)
Summary:
`slot` is unsigned integer which is `always >= 0`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32778

Differential Revision: D19625789

Pulled By: ngimel

fbshipit-source-id: c92c35c65d4372be934283e87aeba99e9e0ef353
2020-01-29 14:04:05 -08:00
c47c78d0bf Revert D19597036: More code fakefp16 mapping unification
Test Plan: revert-hammer

Differential Revision:
D19597036

Original commit changeset: deed61945884

fbshipit-source-id: c057e57810a99464aefb00b645613ecd6a7c5533
2020-01-29 13:32:42 -08:00
3ee6673e99 Refreshing numel on a stride update is pointless. (#32116)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32116

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Differential Revision: D19579875

Pulled By: ezyang

fbshipit-source-id: 00393c9dc101967c79231bfae36b23b7b80135fb
2020-01-29 13:26:28 -08:00
8c6f52ac24 Delete resize_dim() (#32114)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32114

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Differential Revision: D19579876

Pulled By: ezyang

fbshipit-source-id: d09a231ba891403a06eae0c2203e0ad7dd6d3a12
2020-01-29 13:26:23 -08:00
b371eab8c7 Expunge last two sites of resize_dim (#32112)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32112

It turns out we already removed these from the CPU version; copy
the changes over.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Differential Revision: D19579874

Pulled By: ezyang

fbshipit-source-id: e40efbf94e128fd81421b227b76dd9c9c0256d96
2020-01-29 13:25:22 -08:00
c7df28a2a3 Delete copy/move constructors on these RAII guards. (#32727)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32727

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Differential Revision: D19621858

Pulled By: ezyang

fbshipit-source-id: 5112c849252478d8249de4f8c8c5a2d6caf60672
2020-01-29 13:20:15 -08:00
5ffa1efa52 Add missing C10_API to dispatch key TLS setter/getters (#32557)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32557

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Differential Revision: D19579853

Pulled By: ezyang

fbshipit-source-id: 45f83a7a5ead0344e4c13526abb5fafdedaed4a4
2020-01-29 13:20:09 -08:00
3b47922855 Improve documentation in dispatcher; remove unnecessary optional (#32533)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32533

Applies renames based on comments in #32439.  I also updated some
other documentation and variable names while I was at it.

Fixes #32435.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Differential Revision: D19579854

Pulled By: ezyang

fbshipit-source-id: 85021a92a2a84501f49ee5c16318f81f5df64f8d
2020-01-29 13:18:29 -08:00
8cb05e72c6 Port BCELoss to ATen to increase accuracy (#31365)
Summary:
Fixes issue https://github.com/pytorch/pytorch/issues/24933
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31365

Differential Revision: D19557712

Pulled By: ezyang

fbshipit-source-id: 3ae78c949b2f6c21b294d986d28e09daa9b0c526
2020-01-29 12:58:37 -08:00
50d82f5122 Make VC++ version a parametrizable option for Windows CI. (#32043)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32043

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Differential Revision: D19621910

Pulled By: ezyang

fbshipit-source-id: dce00a56ff679548fd9f467661c3c54c71a3dd4e
2020-01-29 12:11:47 -08:00
e84f9d9d0c Fix TensorProtosDBInput AttributeError (#32274)
Summary:
https://github.com/pytorch/pytorch/issues/6794
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32274

Differential Revision: D19621889

Pulled By: ezyang

fbshipit-source-id: 1bdd042b6421a2798c7f1e9030dfc6dfc1246989
2020-01-29 12:05:43 -08:00
8693164acb Randomize xla port (#32718)
Summary:
fixes https://github.com/pytorch/pytorch/issues/30717
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32718

Differential Revision: D19607998

Pulled By: ailzhang

fbshipit-source-id: 81ba9c7c71988a64cdc8fa5500967509657438fe
2020-01-29 12:04:01 -08:00
b5d8982ae2 clean up GIL usuage (#32748)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32748

This is to follow up PR #30630, we need to have GIL when calling jit::toPyObject(), for some binded functions need to be taged with GIL release if underneath C++ codes requires GIL. so
1. pyRef::to_here() and pyRef::local_value() added GIL
2. pyRef::pickle and pyRef::unpickle() added GIL release tag
3. in request_callback_impl, also added GIL as needed
4. for typeParser, use cached jitCompilationUnit_, also clean it up in cleanUp() function
ghstack-source-id: 97373011

Test Plan: unit test

Differential Revision: D19612337

fbshipit-source-id: 4d09f9b52ba626545ae7d31fea6b671301ed3890
2020-01-29 11:58:46 -08:00
eab99ab08e [android] fbjni DoNotStrip annotation for oss native methods (#32567)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32567

As a first change to support proguard.
even if these methods could be not called from java, on jni level we register them and this registration will fail if methods are stripped.

Adding DoNotStrip to all native methods that are registered in OSS.

After integration of consumerProguardFiles in fbjni that prevents stripping by proguard DoNotStrip it will fix errors with proguard on.

Test Plan: Imported from OSS

Differential Revision: D19624684

Pulled By: IvanKobzarev

fbshipit-source-id: cd7d9153e9f8faf31c99583cede4adbf06bab507
2020-01-29 11:52:53 -08:00
2471ddc96c Improved speed of frobenous norm for non-complex dtype (#30871)
Summary:
In-tree changes to pytorch to support complex numbers are being submitted here.
Out-of-tree support for CUDA complex numbers is here: [pytorch-cuda-strided-complex extension](https://gitlab.com/pytorch-complex/pytorch-cuda-strided-complex)

Changes:
[x] Fixed performance issue raise in https://github.com/pytorch/pytorch/issues/30704 so that non-complex numbers do not call `conj()` and `real()`.
[x] Fixed tensor_to_numpy() conversion likely broken by a `checkBackend()` in https://github.com/pytorch/pytorch/issues/27064.
[x] Fixed some ReduceOps and TensorCompare Ops that recently added a `checkBackend()`.
    - `checkBackend()` is replaced with a device type check and a layout check.
    - This ensures the ComplexCPU Type ID is supported.
[x] Added AVX support for complex `exp()`, as requested in https://github.com/pytorch/pytorch/issues/755
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30871

Differential Revision: D19200726

Pulled By: ezyang

fbshipit-source-id: d7e1be0b0a89c5d6e5f4a68ce5fcd2adc5b88277
2020-01-29 11:43:53 -08:00
b1c85dd916 Custom RNG DispatchKey (#32325)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32325

The purpose of this PR is to enable PyTorch dispatching on `at::Generator*` parameters and demonstrate how it can be used in cpp extensions to implement custom RNG.
1. `CustomRNGKeyId` value added to DispatchKey enum and `DispatchKeySet key_set_` added to `at::Generator`
2. The overloaded `operator()(at::Generator* gen)` added to MultiDispatchKeySet.
3. The existing CPUGenerator and CUDAGenerator class are supplied with CPUTensorId and CUDATensorId dispatch keys
4. The implementation of CPU's `cauchy_kernel`(as an example, because it's already moved to ATen) was templatized and moved to `ATen/native/cpu/DistributionTemplates.h` to make it available for cpp extensions
5. Minor CMake changes to make native/cpu tensors available for cpp extensions
6. RegisterCustomRNG test that demonstrates how CustomCPUGenerator class can be implemented and how custom_rng_cauchy_ native function can be registered to handle Tensor::cauchy_ calls.

Test Plan: Imported from OSS

Differential Revision: D19604558

Pulled By: pbelevich

fbshipit-source-id: 2619f14076cee5742094a0be832d8530bba72728
2020-01-29 11:30:04 -08:00
642c9ef922 More code fakefp16 mapping unification
Summary: ATT

Reviewed By: amylittleyang

Differential Revision: D19597036

fbshipit-source-id: deed61945884fb4b01d058f3c72c75f5a937a41c
2020-01-29 11:01:24 -08:00
d119de8abd Deduplication of type casting codes (#32730)
Summary:
These codes are implemented twice at different places by different people, we should merge them together.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32730

Differential Revision: D19622023

Pulled By: ezyang

fbshipit-source-id: a9cbda31428b335bf28a7e4050f51f58e787b94f
2020-01-29 10:13:15 -08:00
cbb744f00f apply linter to rpc test files (#32659)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32659

Applies linter to RPC test files so that we can use linter shortcuts
without getting unnecessary changes to the whole file.
ghstack-source-id: 97361237

Test Plan: No actual changes.

Differential Revision: D19584742

fbshipit-source-id: a11ce74ee0e2817e6f774fff7c39bcab06e99307
2020-01-29 09:49:45 -08:00
8bc889e502 Fix crash of SobolEngine if default tensor type is cuda (#32496)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32496

Addresses https://github.com/pytorch/pytorch/issues/32494

Test Plan:
```
import torch
from torch.quasirandom import SobolEngine

torch.set_default_tensor_type(torch.cuda.FloatTensor)
se = SobolEngine(3)
```

Reviewed By: 2timesjay

Differential Revision: D19517571

fbshipit-source-id: 02eb499ffbd4260474d348e9bb536fb8c36c2c31
2020-01-29 08:49:18 -08:00
c7bf4d22fe added exception args to the returned error message (#32693)
Summary:
addresses https://github.com/pytorch/pytorch/issues/32692
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32693

Differential Revision: D19606757

Pulled By: mrshenli

fbshipit-source-id: 79fc09f8bb6a33e1b73ce0bbc45387544c7adc1b
2020-01-29 08:26:27 -08:00
c35ca84eee Get rid of some unused THGenerate*Type defines. (#32657)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32657

The goal here is to simplify the codegen enough that we can just handwrite the bindings, so anything in here is "bad".

Test Plan: Imported from OSS

Differential Revision: D19584521

Pulled By: gchanan

fbshipit-source-id: 93005b178228c52a1517e911adde2e2fe46d66a5
2020-01-29 08:12:45 -08:00
594cadeb8f Make sure temporary vectors are properly initialized in avx2 code (#32722)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32722

Checked using [this](https://godbolt.org/z/uAaE9R) that it gives the correct assembly.

Test Plan: Imported from OSS

Differential Revision: D19610012

Pulled By: albanD

fbshipit-source-id: 4d1cb812951ae03d412a0fba3c80730f0d286e1f
2020-01-29 07:58:25 -08:00
5e2311033e fix windows build (#32762)
Summary:
remove windows visibility macro
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32762

Differential Revision: D19616367

Pulled By: eellison

fbshipit-source-id: d824162fe92bff4cb2b1a170312cd14b6d7bd99d
2020-01-28 22:55:48 -08:00
fd850685da Updating submodules
Summary:
GitHub commits:

b81d0657df

Test Plan: n/a

Reviewed By: 2d2d2d2d2d

fbshipit-source-id: 82d39025e331083e58c0d0cc9b47985e590bb289
2020-01-28 21:03:34 -08:00
62d652f922 replaces .at with [] in getSlot (#32677)
Summary:
per title. cc qizzzh
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32677

Differential Revision: D19596094

Pulled By: ngimel

fbshipit-source-id: 06177b9e12d203d84b541205437ef2ad51db0fac
2020-01-28 20:49:03 -08:00
c729614997 [JIT] Improve May Contain Alias Using Contained Elements (#32326)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32326

Now that we have type-level granularity we can improve `mayContainAlias` queries. Each new values is initialized as containing the wildcard set of each contained mutable type. Whenever a value is added to a container it is set to the wildcard set. Now, to check if any two values contain overlapping values, we can just check if the `containedMemoryLocations` of two sets overlap.

Test Plan: Imported from OSS

Differential Revision: D19563262

Pulled By: eellison

fbshipit-source-id: c6d7489749c14b2054a6d50ef75baca699ada471
2020-01-28 18:08:56 -08:00
25d33a2ee8 [JIT] Use Type Level Granularity in Alias Analysis Wildcards (#32251)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32251

Previously wildcard sets were associated by TypeKind, meaning all Lists were in one alias set, all Classes were in one alias set, etc. We can improve analysis by bucketing wildcard sets by TypePtr instead. Any two mutable types which can unify should be in the same wildcard set bucket.

This also allows us do much simpler `mayContainAlias` analysis, and also improves `analyzeConservative` analysis because now we can recurse through all contained memory locations and mark writes, instead of just recursing only level deep in contained elements.

Test Plan: Imported from OSS

Differential Revision: D19563263

Pulled By: eellison

fbshipit-source-id: 371a37d1a8596abc6c53f41c09840b6c140ea362
2020-01-28 18:07:48 -08:00
02f055ffd9 Add mapping for FbFCPacked in fakefp16 transform
Summary: ATT. Since the infra is there.

Test Plan: run it

Reviewed By: amylittleyang

Differential Revision: D19605250

fbshipit-source-id: c68be4d7963afa4fa5f8f60c90f1913605eae516
2020-01-28 17:00:24 -08:00
18aab32959 Move exponential_ from TH to Aten (CPU) (#32501)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32501

This diff will address https://github.com/pytorch/pytorch/issues/24699

We ask the input `lambda` to be >= 0 to be same as https://docs.scipy.org/doc/numpy-1.15.0/reference/generated/numpy.random.exponential.html#numpy-random-exponential. This does not exist in the previous implementation.

Benchmark I am using PT operator microbenchmark
```
================================================================================
Before the change, Program Output:
================================================================================
# ----------------------------------------
# PyTorch/Caffe2 Operator Micro-benchmarks
# ----------------------------------------
# Tag : short

# Benchmarking PyTorch: exponential_
# Mode: Eager
# Name: exponential__M512_N512_cpu
# Input: M: 512, N: 512, device: cpu
Forward Execution Time (us) : 21311.746

================================================================================
After the change, Program Output:
================================================================================
# ----------------------------------------
# PyTorch/Caffe2 Operator Micro-benchmarks
# ----------------------------------------
# Tag : short

# Benchmarking PyTorch: exponential_
# Mode: Eager
# Name: exponential__M512_N512_cpu
# Input: M: 512, N: 512, device: cpu
Forward Execution Time (us) : 20919.914

================================================================================
```

Test Plan: Sandcastle and Github tests

Reviewed By: BIT-silence

Differential Revision: D19518700

fbshipit-source-id: 0e79cb6a999c1278eb08b0d94cf61b119c85a36c
2020-01-28 16:59:22 -08:00
1f78bd0774 [caffe2] Early error throwing for currupted embeddings
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/32717

Reviewed By: xianjiec

Differential Revision: D19604954

fbshipit-source-id: c02eccf048c0dba3f66d729ab1fda50f3cacef63
2020-01-28 16:55:29 -08:00
6f7d5bb3e1 Temporarily disable the test_quantized_rnn test (#32742)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32742

As Title says (Check https://github.com/pytorch/pytorch/issues/32644).
ghstack-source-id: 97352793

Test Plan: CI

Differential Revision: D19611029

fbshipit-source-id: 9f4a155c909f419e41c1d7078eb2796dd17cedd2
2020-01-28 16:50:59 -08:00
43d31ae4c3 Added ONNX model checker to ONNX export (#32298)
Summary:
Included the ONNX model checker code in the ONNX export
this will force onnx checker to run for all models that get exported.
This should help with validating exported models.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32298

Reviewed By: hl475

Differential Revision: D19538251

Pulled By: houseroad

fbshipit-source-id: eb20b124fe59200048f862ddaf20f6c59a0174d5
2020-01-28 16:28:54 -08:00
99228086a6 Added missing period in README.
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/32723

Differential Revision: D19607256

Pulled By: mlacayo

fbshipit-source-id: 2993014d4d90fa26acd5bc01ed7494cc43a29a62
2020-01-28 16:25:04 -08:00
e74e1ccc47 Use direct vector indexing in Object::getSlot() instead of at(). (#31627)
Summary:
This method is pretty hot.  In an internal workload, this single
call to at() accounted for ~2% of overall cycles.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31627

Reviewed By: yinghai

Differential Revision: D19607779

Pulled By: qizzzh

fbshipit-source-id: 1684919049a35fdad686d8396c7dce7243ab92d4
2020-01-28 16:17:16 -08:00
ee60cd9124 Back out "fix view listing in autograd codegen" (#32720)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32720

Original commit changeset: 5ebc4c978af5

Test Plan: existing tests

Reviewed By: chenyangyu1988

Differential Revision: D19603336

fbshipit-source-id: 56051a716c4eedf49cfe7367ff447b4b9c5429ea
2020-01-28 16:10:47 -08:00
2060e0a9dd Split serialization tests to their own file (#32241)
Summary:
Stacked PRs
 * #32244 - Make zip serialization the default
 * **#32241 - Split serialization tests to their own file**

This makes them all easier to run as a batch. This PR is just a code move / fixing up imports. There are still some serialization tests in `test_torch.py` as part of `TestDeviceType`.
](https://our.intern.facebook.com/intern/diff/19415826/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32241

Pulled By: driazati

Differential Revision: D19415826

fbshipit-source-id: a3f6cfe1626ff2f9b9631c409bf525bd32e4639b
2020-01-28 15:04:05 -08:00
0327e75e14 Back out "[caffe2] use JIT'ed fp32 SLS" (#32711)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32711

Original commit changeset: 4f29d34523ef

Test Plan: CI

Differential Revision: D19603967

fbshipit-source-id: af3f647fff416a84290a42217747948bac4d73c6
2020-01-28 14:07:11 -08:00
ffdcbadeaa Minor refactoring to improve code reuse (#32675)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32675

It's good to have one location to do the mapping.

Test Plan: Everything still runs.

Reviewed By: amylittleyang

Differential Revision: D19590354

fbshipit-source-id: d8c0d14e4bdf27da3e13bd4d161cd135d6e3822b
2020-01-28 13:31:48 -08:00
9de3208449 [rpc][flaky-tests] fix for test_handle_send_exceptions and (#32656)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32656

Fixes these flaky tests.

Test Plan: Run the test 500 times and verify that it succeeds every time.

Differential Revision: D19584453

fbshipit-source-id: 07cbc4914211f274182ac0fa74bb5ef6d43392d1
2020-01-28 12:40:12 -08:00
6e7e595c1d [rpc][easy] remove redundant test in rpc_test.py (#32588)
Summary:
Both `test_wait_all_workers` and `test_wait_all_workers_and_shutdown` test the same pattern of initialize RPC, call `_wait_all_workers`, and `rpc.shutdown(graceful=False)`.

`test_wait_all_workers` seems to be more thorough since it tests one worker driving and the others waiting on it as well.

We shouldn't have duplicate test so removing this `test_wait_all_workers_and_shutdown`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32588

Differential Revision: D19566294

Pulled By: rohan-varma

fbshipit-source-id: b69519d169b3964649d47ad75532bda5de538241
2020-01-28 11:55:17 -08:00
0ea65d63cf [JIT] Fix stateful lambda stuff and simplify code in custom C++ binding API
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/32658

Test Plan: Imported from OSS

Differential Revision: D19584701

Pulled By: jamesr66a

fbshipit-source-id: d556c7db2f32900eb1122348402789b59516a7d7
2020-01-28 11:03:04 -08:00
465ebd58ba [JIT] pickle serialization for custom bound classes
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/32604

Test Plan: Imported from OSS

Differential Revision: D19566633

fbshipit-source-id: 9387d3ff45cbd6ccde49ce190a52859481cc301c
2020-01-28 11:02:59 -08:00
34ccfba403 [JIT] Include custom_class.h in torch/script.h
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/32586

Test Plan: Imported from OSS

Differential Revision: D19558716

fbshipit-source-id: be540d8ed7de0834e64be89ae621ae50befc83b0
2020-01-28 11:02:54 -08:00
06c19263d3 [JIT] Serialize attributes and types in ClassType serialization
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/32555

Test Plan: Imported from OSS

Differential Revision: D19544737

Pulled By: jamesr66a

fbshipit-source-id: 2256cfba414a850cdc986bb5872dd4cb177b456c
2020-01-28 11:02:49 -08:00
1719da13f9 [JIT] Support for registering C++ lambdas as methods on custom C++ class
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/32553

Test Plan: Imported from OSS

Differential Revision: D19543269

Pulled By: jamesr66a

fbshipit-source-id: 7e566650295e9d1c4f2f716470e061308a6210a0
2020-01-28 11:01:07 -08:00
da390914bd .circleci: Add workflows for Python 3.8 (#31948)
Summary:
Done by just editing `.circleci/cimodel/data/dimensions.py` to include `3.8` and then regenerated using `.circleci/regenerate.sh`

cc kostmo, mingbowan, ezyang, soumith

Signed-off-by: Eli Uriegas <eliuriegas@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31948

Differential Revision: D19602069

Pulled By: seemethere

fbshipit-source-id: ac57fde9d0c491c7d948a3f5944c3cb324d403c0
2020-01-28 10:26:03 -08:00
0dc38be407 consider FAIL_GUARD while counting indices for GUARDs (#32672)
Summary:
This handles a corner case when a user schedules second bailout after the first one and the first one doesn't fire.
Alternatively, we could go back to the implementation that uses a hash set to remember the indices of bailouts that need to fire.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32672

Differential Revision: D19596872

Pulled By: Krovatkin

fbshipit-source-id: 41dcc374cd2501ac20a9892fb31a9c56d6640258
2020-01-28 08:59:25 -08:00
c64dec1993 Python binding to export bytecode format for lite interpreter (#32621)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32621

Export the "_save_for_mobile" method to Python so that the bytecode format for lite interpreter can be added or updated to the original script model.

It's the first step of python binding for lite interpreter, as discussed in this [internal post](https://fb.workplace.com/groups/1144215345733672/permalink/1478900738931796/) and offline.

Next step is to export the load_for_mobile and run method of mobile module, so that users could verify the mobile model from Python.

Test: use the following python script to display the bytecode part of the updated model file.
```
#!/usr/bin/env python3
import sys
import pickle
import pprint
import zipfile

class FakeObject(object):
    def __init__(self, module, name, args):
        self.module = module
        self.name = name
        self.args = args
        self.state = None

    def __repr__(self):
        state_str = "" if self.state is None else f"(state={self.state!r})"
        return f"{self.module}.{self.name}{self.args!r}{state_str}"

    def __setstate__(self, state):
        self.state = state

class FakeClass(object):
    def __init__(self, module, name):
        self.module = module
        self.name = name
        self.__new__ = self.fake_new

    def __repr__(self):
        return f"{self.module}.{self.name}"

    def __call__(self, *args):
        return FakeObject(self.module, self.name, args)

    def fake_new(self, *args):
        return FakeObject(self.module, self.name, args)

class DumpUnpickler(pickle._Unpickler):
    def find_class(self, module, name):
        return FakeClass(module, name)

    def persistent_load(self, pid):
        return FakeObject("pers", "obj", (pid,))

def main(argv):
    zfile = zipfile.ZipFile(argv[1])
    names = [i for i in zfile.namelist() if "bytecode.pkl" in i]
    if not names:
        print("bytecode.pkl not found.")
        return
    with zfile.open(names[0], "r") as handle:
        value = DumpUnpickler(handle).load()
    pprint.pprint(value)

if __name__ == "__main__":
    sys.exit(main(sys.argv))

```

Test Plan: Imported from OSS

Differential Revision: D19596359

Pulled By: iseeyuan

fbshipit-source-id: 19a4a771320f95217f5b0f031c2c04db7b4079a8
2020-01-28 08:30:20 -08:00
e24ce0e524 Kill some more unused code in function_wrapper.py
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/32600

Test Plan: Imported from OSS

Differential Revision: D19565654

Pulled By: gchanan

fbshipit-source-id: 993c3dc5467639a7690109d07911951a165a412f
2020-01-28 07:38:51 -08:00
9a2691f2fc Fix spelling errors
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/32673

Differential Revision: D19597118

Pulled By: pietern

fbshipit-source-id: f88c1da7548fcee141ed248f5f49d25c1d639955
2020-01-28 04:46:15 -08:00
63170431f9 [jit] fix segfault on missing getstate (#32642)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32642

Previously, if we defined `__setstate__` but not `__getstate__`, we
would segfault. This PR turns that into a comprehensible error message
(and improves another error message as well).

Fixes https://github.com/pytorch/pytorch/issues/25886

Test Plan: Imported from OSS

Differential Revision: D19596463

Pulled By: suo

fbshipit-source-id: dbe76bc36bc747d65fb0223184c009e0e9ba072c
2020-01-28 01:25:37 -08:00
8e4161517e div_kernel: throw when dividing by integer zero (#32629)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/327
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32629

Differential Revision: D19595782

Pulled By: ezyang

fbshipit-source-id: f5bbb298f150efe63a698e8a0b53a84871d16560
2020-01-27 21:41:00 -08:00
b3848c568e Fix flaky test_nccl_timeout. (#32653)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32653

This test was flaky since the watchdog thread could abort the
communicator instead of the thread calling `wait()`. As a result, we could
actually see `NCCL error` instead of `Operation timed out` on the user end.
ghstack-source-id: 97250714

Test Plan: waitforbuildbot

Differential Revision: D19583003

fbshipit-source-id: 5c07326d1a16f214dcdbabed97ca613e0a5b42b9
2020-01-27 21:09:40 -08:00
d68592a440 [JIT] Fix classes as attributes in recursive scripting
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/32594

Test Plan: Imported from OSS

Differential Revision: D19562951

Pulled By: jamesr66a

fbshipit-source-id: 3d5491c1c23456f107390a78be16da687de951e6
2020-01-27 20:37:48 -08:00
b9f764b1c7 Use the C++ current RpcAgent pointer to eliminate the unnecessary argument passing from Python world (#32635)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32635

With the source of truth of current RPC agent moved to C++ world, there is no point of passing current RPC agent from Python world to C++ world.
ghstack-source-id: 97293316

Test Plan:
```
buck test mode/dev-nosan //caffe2/test/distributed/rpc:rpc_fork

buck build mode/dev-nosan //caffe2/test/distributed/rpc:rpc_fork

buck-out/gen/caffe2/test/distributed/rpc/rpc_fork\#binary.par -r test_process_group_debug_info
```

Differential Revision: D5703519

fbshipit-source-id: ef7c28bdb1efd293eb6cafe0b0fca7ca80fa08a6
2020-01-27 20:24:32 -08:00
666e5430f8 Clean up mvlgamma doc (including a weird way to link to reference) (#32667)
Summary:
Intentionally left blank
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32667

Differential Revision: D19594683

Pulled By: ezyang

fbshipit-source-id: 5a6eb0a74f569d3c0db2a35e0ed4b329792a18e4
2020-01-27 20:12:17 -08:00
db8ce7ea2d Back out "Make autogen functions correct for multiple outputs and views" (#32681)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32681

Original commit changeset: a2b41c2d231e

Test Plan: fb and oss tests

Reviewed By: hudeven

Differential Revision: D19591864

fbshipit-source-id: 7068b5563e37bc9a5d415fd535c73fd9d71fe131
2020-01-27 19:54:34 -08:00
5c8535d5b0 Make C++ RpcAgent::currentRPCAgent_ the source of truth of current RPC Agent (#32633)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32633

There were 2 sources of current RPC agent.

- One is in Python world, `torch.distributedrpc.api._agent`.
- The other is in C++ world, `RpcAgent::defaultRpcAgent_`

Setting Python `_agent` to `None`, does not necessarily reset the C++ `defaultRpcAgent_` to `nullptr`.

i.e.
```
 torch.distributedrpc.api._agent = None
```
does not translate to
```
RpcAgent::defaultRpcAgent_ = nullptr
```

This PR is to remove this ambiguity, and use the C++ pointer as source of truth.

The solution is to leverage a pybind11 behavior that it implicitly casts C++ `shared_ptr<RpcAgent>(nullptr)` to Python `None`.
ghstack-source-id: 97293315

Test Plan:
```
buck test mode/dev-nosan //caffe2/test/distributed/rpc:rpc_fork -- test_duplicate_name

buck build mode/dev-nosan //caffe2/test/distributed/rpc:rpc_fork

buck-out/gen/caffe2/test/distributed/rpc/rpc_fork\#binary.par -r test_process_group_debug_info
```

```
buck test mode/dev-nosan //caffe2/torch/fb/distributed/pytorch/tests:test_remote_module

buck test mode/dev-nosan //caffe2/torch/fb/distributed/modules/tests:test_sharded_embedding

buck test mode/dev-nosan //caffe2/torch/fb/distributed/modules/tests:test_sharded_pairwise_attention_pooling

buck test mode/dev-nosan //caffe2/torch/fb/distributed/pytorch/tests:test_rpc
```

Differential Revision: D5733066

fbshipit-source-id: b3e6032ee975f19ca556497edbbf40b517b25be8
2020-01-27 19:34:12 -08:00
1217c9b364 Updating submodules
Summary:
GitHub commits:

3f156207e8
135cff30a5
7aa66c704f
1dc4136644
9166d9f767

Test Plan: n/a

Reviewed By: 2d2d2d2d2d

fbshipit-source-id: fb27e09060ecb4278b4002c02bce48fe9f4dc361
2020-01-27 18:34:38 -08:00
1695915371 Make _wait_all_workers() support being called for multiple times (#32624)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32624

We need this PR to resolve the issue mentioned in https://github.com/pytorch/pytorch/issues/31325#issuecomment-574918917.

The solution is for each `_wait_all_workers()` call, there is a sequence ID added, to identify different calls.
ghstack-source-id: 97277591

Test Plan:
```
buck test mode/dev-nosan //caffe2/test/distributed/rpc:rpc_fork -- test_wait_all_workers

buck build mode/dev-nosan //caffe2/test/distributed/rpc:rpc_fork

buck-out/gen/caffe2/test/distributed/rpc/rpc_fork\#binary.par -r test_wait_all_workers
```

Differential Revision: D5739520

fbshipit-source-id: a64131e09c365179624700514422f5375afe803f
2020-01-27 17:04:02 -08:00
39987de9e4 [vulkan][caffe2] Add logging for descriptor extensions, fp16 storage
Summary:
`fbcode/caffe2/caffe2/mobile/contrib/libvulkan-stub/BUCK` changes comment:

libvulkan-stub contains vulkan headers `VK_HEADER_VERSION 29`

fbandroid uses ndk r17 that includes vulkan `VK_HEADER_VERSION 76`
which contains defines for extensions that we need.

`("include", "**/*.h"),` -> `("include", "*.h"),` means that ndk vulkan headers to use.

For fp16_storage logging need to add boilerplate for `vkGetPhysicalDeviceFeatures2KHR`

Test Plan:
scuba employees device_event

logcat getVulkanInfo().
```
instance ext.name:VK_KHR_surface
instance ext.name:VK_KHR_android_surface
instance ext.name:VK_EXT_swapchain_colorspace
instance ext.name:VK_KHR_get_surface_capabilities2
instance ext.name:VK_EXT_debug_report
instance ext.name:VK_KHR_device_group_creation
instance ext.name:VK_KHR_external_fence_capabilities
instance ext.name:VK_KHR_external_memory_capabilities
instance ext.name:VK_KHR_get_physical_device_properties2
instance ext.name:VK_KHR_external_semaphore_capabilities
device ext.name:VK_KHR_incremental_present
device ext.name:VK_EXT_hdr_metadata
device ext.name:VK_KHR_shared_presentable_image
device ext.name:VK_GOOGLE_display_timing
device ext.name:VK_KHR_push_descriptor
device ext.name:VK_KHR_image_format_list
device ext.name:VK_EXT_queue_family_foreign
device ext.name:VK_ANDROID_external_memory_android_hardware_buffer
device ext.name:VK_KHR_external_semaphore_fd
device ext.name:VK_KHR_external_fence_fd
device ext.name:VK_KHR_external_memory_fd
device ext.name:VK_KHR_external_memory
device ext.name:VK_KHR_swapchain
device ext.name:VK_KHR_external_semaphore
device ext.name:VK_KHR_driver_properties
device ext.name:VK_KHR_sampler_mirror_clamp_to_edge
device ext.name:VK_KHR_multiview
device ext.name:VK_KHR_relaxed_block_layout
device ext.name:VK_KHR_maintenance1
device ext.name:VK_KHR_maintenance3
device ext.name:VK_KHR_maintenance2
device ext.name:VK_EXT_global_priority
device ext.name:VK_KHR_get_memory_requirements2
device ext.name:VK_KHR_descriptor_update_template
device ext.name:VK_KHR_bind_memory2
device ext.name:VK_KHR_shader_draw_parameters
device ext.name:VK_KHR_dedicated_allocation
device ext.name:VK_KHR_create_renderpass2
device ext.name:VK_KHR_draw_indirect_count
device ext.name:VK_KHR_sampler_ycbcr_conversion
device ext.name:VK_KHR_device_group
device ext.name:VK_KHR_external_fence
device ext.name:VK_KHR_variable_pointers
device ext.name:VK_EXT_sampler_filter_minmax
device ext.name:VK_KHR_storage_buffer_storage_class
VULKAN_SYMBOL_WRAPPER_LOAD_INSTANCE_SYMBOL(vkGetPhysicalDeviceFeatures2KHR) res=1
mChipsetInfoUtilInfo.getVulkanInfo():{vk_driver_version=2149056512, vk_device_id=100859905, vk_extension_descriptor_update_template=1, vk_api_version=4198487, vk_support_fp16_storage=0, vk_platform_dlopen=success, vk_shader_int16=1, vk_device_type=1, vk_shader_float64=0, vk_extension_push_descriptor=1, vk_shader_int64=0, vk_wrapper_init=true, vk_vendor_id=20803, vk_max_compute_shared_memory_size=32768, vk_device_name=Adreno (TM) 630, vk_max_compute_work_group_invocations=1024, vk_device_count=1}
```

Reviewed By: dreiss

Differential Revision: D19564664

fbshipit-source-id: 908b34bdcc24d9b03ecc185edbc5cfb6e7aa27c9
2020-01-27 16:34:47 -08:00
812b1ad869 [quantization] FP16 dynamic quantized Linear
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/32331

Test Plan: Imported from OSS

Differential Revision: D19441158

Pulled By: jamesr66a

fbshipit-source-id: c04247ffe707be68718c486c31bc6c6040f7dc11
2020-01-27 15:45:32 -08:00
389b9c180b Updating submodules
Summary:
GitHub commits:

9ae8cbb0a1
986df37135
ef4d11b6e1

Test Plan: n/a

Reviewed By: 2d2d2d2d2d

fbshipit-source-id: 04e7a5ad02cb412ef36672ec30e10a898c525232
2020-01-27 14:43:34 -08:00
57519bd829 Revert "Fix iterator for ncclCommWatchdog. (#32571)" (#32649)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32649

This reverts commit 59dbece3716775c3e6f3a428f73fbf1bde8fac4f.

Revert "Enhance NCCL watchdog to acitvely abort communicators for timed out ops. (#32338)"

This reverts commit f86d6c6afd0e981642d20b4269837334ec46c140.

Test Plan: Imported from OSS

Differential Revision: D19584224

Pulled By: ezyang

fbshipit-source-id: 6cc0ad56ba1f3aec5b48db44e8c6c24c8105db4a
2020-01-27 14:25:30 -08:00
897b6908d4 Kill THIntegerTensor, THDenseTensor, THDenseIndexTensor. (#32599)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32599

these aren't used anymore.

Test Plan: Imported from OSS

Differential Revision: D19565655

Pulled By: gchanan

fbshipit-source-id: c0da31365df7342352f9850ae2a2e1e611a6886b
2020-01-27 13:26:31 -08:00
f6c46df856 Adding native qconcat
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/32252

Test Plan: Imported from OSS

Differential Revision: D19422889

Pulled By: z-a-f

fbshipit-source-id: 23dd5f50009cc4c46b36c39ae1168b57f9a977a4
2020-01-27 11:24:46 -08:00
f0917dce7f Revert D19562258: [pytorch][PR] Fixes moving after weight norm application
Test Plan: revert-hammer

Differential Revision:
D19562258

Original commit changeset: 4fef006e32cd

fbshipit-source-id: 62e40de19331a61f4a65b7371460fe7dc28f23ea
2020-01-27 11:18:19 -08:00
64323ae177 Back out "Use simd version for fp16 conversions" (#32640)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32640

Original commit changeset: 3b1ee0ba756e

Reverting according to https://our.intern.facebook.com/intern/diff/D19291499/?transaction_id=1347995678706116&dest_fbid=465672071047258

Test Plan: unittests.

Reviewed By: jspark1105, jianyuh

Differential Revision: D19576708

fbshipit-source-id: bec92318523498067935234ab702c925ece71da6
2020-01-27 10:01:24 -08:00
e36cbb8f2f Fixes moving after weight norm application (#32563)
Summary:
This PR updates how RNNs handle their "flat weights." In particular, it allows for only some flat weights to be "materialized" when apply is called, and it updates the flattening behavior to only apply if all flat weights are (1) materialized, (2) share a dtype and (3) are acceptable to cuDNN.

One test is modified and another created to test these changes. One practical effect of this change is that weight norm can be successfully applied to a module BEFORE that module is moved to an accelerator. Previously doing so would throw an error.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32563

Differential Revision: D19562258

Pulled By: mruberry

fbshipit-source-id: 4fef006e32cdfd8e3e3d519fc2ab5fc203dd7b36
2020-01-27 09:57:43 -08:00
5ac2593d4f [ROCm] Adjust elementwise_kernel settings on ROCm (#32609)
Summary:
Recent PR https://github.com/pytorch/pytorch/issues/31974 and upcoming PR https://github.com/pytorch/pytorch/issues/32383 are changing the behavior of the elementwise_kernel infrastructure on CUDA.

In order to stay in sync, change the nd-loop behavior to match ROCm and CUDA for now. Once the full rework is done, the ROCm settings will likely diverge again.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32609

Differential Revision: D19580121

Pulled By: ezyang

fbshipit-source-id: 4c8dcf6db3ac973e48ece6a665615cfe7d7cb764
2020-01-27 09:26:28 -08:00
ca9dc67094 0-dim batch size input for interpolate. (#32400)
Summary:
This PR adds support for 0-dim batch size input for `torch.nn.functional.interpolate` for various modes of interpolation.

Fixes part of gh-12013

CC: rgommers  ezyang
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32400

Differential Revision: D19557090

Pulled By: ezyang

fbshipit-source-id: 6822f148bb47bfbcacb5e03798bf2744f24a2a32
2020-01-27 09:24:46 -08:00
602394e996 verify input sizes for instance norm and group norm (#29082)
Summary:
Fix for https://github.com/pytorch/pytorch/issues/19250
Pull Request resolved: https://github.com/pytorch/pytorch/pull/29082

Differential Revision: D19373507

Pulled By: ezyang

fbshipit-source-id: 231a79280f4cd7db2c26218a60869356a124bf77
2020-01-27 09:05:56 -08:00
19bb496a0d Enable mkldnn on windows (#31355)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/15982.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31355

Differential Revision: D19428979

Pulled By: ezyang

fbshipit-source-id: bee304c5913e70e8dead3098e9796051861cd666
2020-01-27 09:00:02 -08:00
957a07ffbd [ROCm] Enable Caffe2 video operators for ROCm
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/32610

Differential Revision: D19580129

Pulled By: ezyang

fbshipit-source-id: 16d620173dcc231068e041d599aa09c94e677a9e
2020-01-27 08:29:07 -08:00
5b321a0985 [rpc] make handling of FORWARD_AUTOGRAD_REQ in request_callback_impl (#32476)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32476

This makes the handling of FORWARD_AUTOGRAD_REQ in request_callback
nonblocking. Processing this message requires unwrapping the message with
autograd information, processing the original message, and sending back the
message with autograd information wrapped. This makes the processing the
original message nonblocking by grabbing a future to it and marking the parent
future as completed when this one completes.
ghstack-source-id: 97221251

Test Plan: `test_rpc_spawn.py` and `test_dist_autograd_spawn.py` both pass.

Differential Revision: D19509501

fbshipit-source-id: 84ad2f9c5305ed11ed9bb0144b1aaf5f8698cd2b
2020-01-27 00:47:27 -08:00
1e5aead35b Make cuda search process of cpp extension quiet (#32620)
Summary:
Fixes https://discuss.pytorch.org/t/error-with-cpp-extentions/67559.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32620

Differential Revision: D19576164

Pulled By: soumith

fbshipit-source-id: 076229322375774bec03ef2632fc233000c15391
2020-01-26 20:26:43 -08:00
8fbe1ccd16 faster bailout tests (#32266)
Summary:
Reduces the overhead of `prim::BailOut` nodes.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32266

Differential Revision: D19503336

Pulled By: Krovatkin

fbshipit-source-id: daa0c373f0fa17edd689600b75e7e4ba98b4670a
2020-01-26 19:44:00 -08:00
12d5933969 Bug fix of norm minimization for dev mode (#31462)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31462

Fix the divide by zero issue in norm minimization in dev mode

Test Plan: buck run mode/dev vision/video_modeling/classification/tools:test_octGloRe_quantization -- --test_data=/mnt/vol/gfsadslearner-frc3c01/fblearner_flow/users/summerdeng/xray_video/deep_vision_video_yufei_test_data_fcc_v4p2_10.csv --output_dir /mnt/vol/gfsadslearner-frc3c01/fblearner_flow/users/summerdeng/xray_video/octGloRe --load_model_path=/mnt/vol/gfsfblearner-oregon/flow/data/2019-10-15/e2681db8-e4f5-4b70-ae18-45bf0b8fbfbc/train_model_epoch0_inputcount0_final.mdl --dataset_name="FCC V4P2" --num_labels=1099 --column_handle="handle" --clip_per_video=1 --num_groups=24 --width_per_group=2 --batch_size=32 --histogram_file=/mnt/vol/gfsadslearner-frc3c01/fblearner_flow/users/summerdeng/xray_video/octGloRe/hist_octGloRe_final_24x2_fcc_v4p2_1clip_f144586257_nullfix_100k_compiled.hist --int8_model_type="pb"  --int8_predict_net_path="reproduce_octGloRe_final_24x2_predict_net_int8_l2approx_wminmax_from_mdl.pb" --int8_init_net_path="reproduce_octGloRe_final_24x2_init_net_int8_l2approx_wminmax_from_mdl.pb" --weight_quant="l2_approx" --activation_quant="l2_approx"  --print_model --int8_model_saved --num_iter 10

Reviewed By: jspark1105

Differential Revision: D19172591

fbshipit-source-id: 994a20e3364b0dc33623a11281e0bdbc2e06159d
2020-01-26 12:44:14 -08:00
90a259e1e2 Add warning regarding pickle insecurity on torch.load documentation (#32593)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/31875

Added a small warning box based on the one presented on the [pickle](https://docs.python.org/3/library/pickle.html) module regarding the safety issues of unpickling files. i.e., unwanted code execution.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32593

Differential Revision: D19572292

Pulled By: ngimel

fbshipit-source-id: 69e7de390133ea77bddcadcd5b6820193c8abcc9
2020-01-25 22:12:37 -08:00
3bbb36e02d Update linspace types (#32218)
Summary:
Changes the linspace functions to be more consistent as requested in https://github.com/pytorch/pytorch/issues/31991. The code has also been updated to avoid an early rounding error; the line `scalar_t step = (scalar_end - scalar_start) / static_cast<static_t>(steps-1)` can result in `step = 0` for integer scalars, and this gives unintended results. I examined the new output using
```
import torch

types = [torch.uint8, torch.int8, torch.short, torch.int, torch.long, torch.half, torch.float, torch.double]

print('Testing linspace:')
for type in types:
    print(type, torch.linspace(-2, 2, 10, dtype=type))
```
which returns
```
Testing linspace:
torch.uint8 tensor([254, 254, 254, 255, 255,   0,   0,   1,   1,   2], dtype=torch.uint8)
torch.int8 tensor([-2, -2, -2, -1, -1,  0,  0,  1,  1,  2], dtype=torch.int8)
torch.int16 tensor([-2, -2, -2, -1, -1,  0,  0,  1,  1,  2], dtype=torch.int16)
torch.int32 tensor([-2, -2, -2, -1, -1,  0,  0,  1,  1,  2], dtype=torch.int32)
torch.int64 tensor([-2, -2, -2, -1, -1,  0,  0,  1,  1,  2])
torch.float16 tensor([-2.0000, -1.5557, -1.1113, -0.6670, -0.2227,  0.2227,  0.6660,  1.1113,
         1.5547,  2.0000], dtype=torch.float16)
torch.float32 tensor([-2.0000, -1.5556, -1.1111, -0.6667, -0.2222,  0.2222,  0.6667,  1.1111,
         1.5556,  2.0000])
torch.float64 tensor([-2.0000, -1.5556, -1.1111, -0.6667, -0.2222,  0.2222,  0.6667,  1.1111,
         1.5556,  2.0000], dtype=torch.float64)
```
which is the expected output: `uint8` overflows as it should, and the result of casting from a floating point to an integer is correct.

This PR does not change the logspace function.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32218

Differential Revision: D19544224

Pulled By: ngimel

fbshipit-source-id: 2bbf2b8552900eaef2dcc41b6464fc39bec22e0b
2020-01-25 20:23:54 -08:00
5fd037ce44 Fix MagmaInitializesCorrectly_CUDA by using an invertible matrix (#32547)
Summary:
This test case had been using the tensor

```
1  2  3  4
5  6  7  8
9  10 11 12
13 14 15 16
```

which is not an invertible tensor and causes the test case to fail, even if magma gets initialized just fine. This change uses a tensor that is invertible, and the inverse doesn't include any elements that are close to zero to avoid floating point rounding errors.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32547

Differential Revision: D19572316

Pulled By: ngimel

fbshipit-source-id: 1baf3f8601b2ba69fdd6678d7a3d86772d01edbe
2020-01-25 20:00:54 -08:00
320d1a1573 Fix wrong typing (torch/nn/parameter.pyi) (#32617)
Summary:
A constructor of `nn.Parameter` has default values on `data` and `requires_grad`, but in type stub, there are no default values.

Resolve https://github.com/pytorch/pytorch/issues/32481
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32617

Differential Revision: D19571397

Pulled By: ngimel

fbshipit-source-id: fd14298aa472b7575221229cecf5a56f8c84f531
2020-01-25 16:19:33 -08:00
69283388ca [pytorch] codegen flags to whitelist op registrations / generate to separate files (#32451)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32451

This PR adds a few new parameters to ATen codegen script:

```
1. op_registration_whitelist
Can be used to filter op registrations for selective build;

2. type_whitelist
Can be used to filter types (CPUType, CUDAType, ...) for selective build;

3. per_op_registration
When set it will group function registrations by op name and write to separate files;
```

1 & 2 are introduced for mobile custom build without relying on static dispatch;
3 is introduced to solve custom build with multi-library / multi-model (needed by FB
internal build - see more details: https://fb.quip.com/ZVh1AgOKW8Vv).

These flags should work independently with each other (and independent to USE_STATIC_DISPATCH).

Not setting them should have no effect compared to master.
ghstack-source-id: 97214788

Test Plan: - tested all 3 params with FB internal build changes.

Differential Revision: D19427919

fbshipit-source-id: a381fe5f768fe2e9196563787f08eb9f18316e83
2020-01-25 15:27:29 -08:00
0afe195046 [pytorch] move type_derived_methods out of anonymous namespace (#32275)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32275

Currently TypeDerived (e.g. `CPUType::`) methods are declared and
defined in anonymous namespace as they are only called from c10
dispatcher - except for STATIC_DISPATCH mode, where they can be directly
called from Functions.h.

We plan to generate c10 op registration into separate files for internal
xplat/BUCK build, thus we need declare these methods in non-anonymous
namespace.

I feel it's easier to simply change it unconditionally, unless there are
some side effect I'm not aware of - `TypeDefault::` methods are in
non-anonymous namespace anyway.
ghstack-source-id: 97214789

Test Plan: - CI

Differential Revision: D19426692

Pulled By: ljk53

fbshipit-source-id: 44aebba15f5e88ef4acfb623844f61d735016959
2020-01-25 15:24:32 -08:00
bd20274e8f [caffe2] use JIT'ed fp32 SLS (#32413)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32413

Use JIT'ed fp32 SLS in Caffe2 operators

Test Plan: CI

Reviewed By: jianyuh

Differential Revision: D19460555

fbshipit-source-id: 4f29d34523efb6ea1e4c324cc8c93c96990c6aad
2020-01-25 12:57:18 -08:00
6ad9e5c70d Support TorchScript call over remote API (RRef) (#32466)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32466

It's a follow-up work of https://github.com/pytorch/pytorch/pull/32197.

In https://github.com/pytorch/pytorch/pull/32197, `rpc.sync_rpc(..) `and `rpc.rpc_async(..)` support taking a TorchScript annotated Python function as the user function for RPC.

This PR extend along this direction by making `rpc.remote(..)` support taking a TorchScript annotated Python function as well.

ghstack-source-id: 97211168

Test Plan:
# Unit tests

```
buck test mode/dev-nosan //caffe2/test/distributed/rpc:rpc_fork -- test_script_function_exception

buck build mode/dev-nosan //caffe2/test/distributed/rpc:rpc_fork

buck-out/gen/caffe2/test/distributed/rpc/rpc_fork\#binary.par -r test_script_function_exception
```

```
buck test mode/dev-nosan //caffe2/test/distributed/rpc:dist_autograd_fork -- test_backward_simple_script_call

buck build mode/dev-nosan //caffe2/test/distributed/rpc:dist_autograd_fork

buck-out/gen/caffe2/test/distributed/rpc/dist_autograd_fork\#binary.par -r test_backward_simple_script_call
```

Differential Revision: D19440633

fbshipit-source-id: d37f6dcdc0b80d35ac7bcba46ad6f9b831c3779b
2020-01-25 02:18:27 -08:00
e0ffe72649 [aten] fix shadowing variable warning (#32573)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32573

Fix the following warning
```
caffe2/aten/src/ATen/ParallelOpenMP.h:36:9: warning: declaration of ‘num_threads’ shadows a previous local [-Wshadow=compatible-local]
     int64_t num_threads = omp_get_num_threads();
         ^~~~~~~~~~~
caffe2/aten/src/ATen/ParallelOpenMP.h:29:9: note: shadowed declaration is here
   int64_t num_threads = omp_in_parallel() ? 1 : omp_get_max_threads();
         ^~~~~~~~~~~
```

Test Plan: CI

Reviewed By: ilia-cher

Differential Revision: D19552578

fbshipit-source-id: b8388de1aaa2bb7676b777c93b8ba9c25f5a3d51
2020-01-24 18:48:07 -08:00
169541871a Add operator support for dynamic quant on mobile (#32479)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32479

Run dynamic quantization on mobile (similar to FBGEMM). Currently only implemented on linear operator

Test Plan:
python test/test_quantized.py TestDynamicQuantizedLinear.test_qlinear

Imported from OSS

Differential Revision: D19542980

fbshipit-source-id: c9f6e5e8ded4d62ae0f2ed99e478c8307dde22ed
2020-01-24 17:51:54 -08:00
59dbece371 Fix iterator for ncclCommWatchdog. (#32571)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32571

The watchdog thread would erase an element and call `it--` (implicitly
relying on `it++` in the for loop to position correctly). Although, `it--`
would cause undefined behavior if the iterator is pointing to begin(). As a
result, I've modified the logic to update the iterator appropriately.

I've also enhanced the watchdog thread to catch and log exceptions.
ghstack-source-id: 97150763

Test Plan: waitforbuildbot

Differential Revision: D19551365

fbshipit-source-id: 426835819ad8d467bccf5846b04d14442a342f78
2020-01-24 17:34:36 -08:00
1218a16aae [pytorch][refactor] Explicitly use auto* for pointers (#32548)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32548

As Title says.
ghstack-source-id: 97175523

Test Plan: CI

Differential Revision: D19541893

fbshipit-source-id: 96dce6964e6a89393d4159401a59672f041f51d3
2020-01-24 17:20:38 -08:00
e7edc5f20e [jit] Cloning constants in ClassType (#32371)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32371

After we add constants to ClassType, we didn't update clone to
clone the constants, this PR adds the support
fixes: https://github.com/pytorch/pytorch/issues/32368

Test Plan:
python test/test_jit.py

Imported from OSS

Differential Revision: D19564378

fbshipit-source-id: dbb13fb889d6ea9291034313b1f3c9aff4748bda
2020-01-24 16:48:38 -08:00
666472a38d [docs] Change fut.wait() to torch.jit._wait(fut) in jit overview docs (#32336)
Summary:
It looks like the jit Future does not have a `wait()` anymore and this throws an error when trying to run this code.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32336

Differential Revision: D19559922

Pulled By: rohan-varma

fbshipit-source-id: a5aa67990595e98e0682a20cf5aced17c2ae85bb
2020-01-24 16:40:22 -08:00
6412ca3ce9 duplicate symbols with AT_PARALLEL_OPENMP=0 (#32568)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32568

explicitly disabling openmp actually causes it to be used.

Test Plan: CI passes

Reviewed By: ilia-cher

Differential Revision: D19549732

fbshipit-source-id: 767b92148f47a1450ded46e101cd3d9b331a5d40
2020-01-24 16:27:50 -08:00
91f10a1de1 [quant][graphmode][refactor] Better API for fold_convbn (#32380)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32380

We'll clone the module first and then fold conv bn and return a new
module

Test Plan:
.

Imported from OSS

Differential Revision: D19508033

fbshipit-source-id: 328e91a2c9420761c904a7f2b62dab4cfaaa31ac
2020-01-24 15:46:47 -08:00
52f8f031ac add diag into pt operator microbenchmark (#32597)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32597

Currently, there is no benchmark test about diag operator. This diff will add one into the suite.

Test Plan:
```
# ----------------------------------------
# PyTorch/Caffe2 Operator Micro-benchmarks
# ----------------------------------------
# Tag : short

# Benchmarking PyTorch: diag
# Mode: Eager
# Name: diag_dim1_M64_N64_diagonal0_outTrue_cpu
# Input: dim: 1, M: 64, N: 64, diagonal: 0, out: True, device: cpu
Forward Execution Time (us) : 28.496

# Benchmarking PyTorch: diag
# Mode: Eager
# Name: diag_dim2_M128_N128_diagonal-10_outFalse_cpu
# Input: dim: 2, M: 128, N: 128, diagonal: -10, out: False, device: cpu
Forward Execution Time (us) : 45.179

# Benchmarking PyTorch: diag
# Mode: Eager
# Name: diag_dim1_M256_N256_diagonal20_outTrue_cpu
# Input: dim: 1, M: 256, N: 256, diagonal: 20, out: True, device: cpu
Forward Execution Time (us) : 49.009
```

Reviewed By: mingzhe09088

Differential Revision: D19564024

fbshipit-source-id: 828a3e0e0e06810a77eb5ddb734efd30e4a63acf
2020-01-24 15:41:04 -08:00
9e0ce72e9e [pytorch] change op dependency output to use double-quoted strings (#32464)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32464

Changed to double quoted strings to make FB linter happy.

Test Plan: Imported from OSS

Differential Revision: D19507859

Pulled By: ljk53

fbshipit-source-id: fa70535c7fbea73214b3b0efb0532184b5ee6854
2020-01-24 15:27:28 -08:00
2bfd33b4ab [refactor] Adding FoldConvBatchNorm2dHelper (#32374)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32374

Moving all fold conv bn code to a class to prepare for making
it work with shared ClassType

Test Plan:
compiles

Imported from OSS

Differential Revision: D19508032

fbshipit-source-id: 4e9cf714111305d2b5474d4506507078f69f0c84
2020-01-24 14:41:20 -08:00
573a30270c [pytorch] Minor: boilerplate to propagate errors in request_callback_impl (#32556)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32556

Out of caution, avoid assuming that there's never a failure in a couple of
request_calback_impl case handlers, but rather propagate the error.
ghstack-source-id: 97128697

Test Plan: buck test mode/dev-nosan caffe2/test/...

Differential Revision: D19544685

fbshipit-source-id: 67c55626960bd42a5b0dec7841e8ba44ab059eb9
2020-01-24 14:37:33 -08:00
3ab30753e9 Make autogen functions correct for multiple outputs and views (#31990)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31990

This PR does three things:
- Add a new `allow_rebase_history` flag to the differentiable views. If set, trying to rebase their history will raise an error.
- Make sure that the codegen functions verify this flag before doing inplace operations so that they fail before doing the inplace modification.
- Make sure the codegen functions set this flag properly when we don't support rebasing the history of the output.

The codegen change can be found [here](4bf180caa0).

Test Plan: Imported from OSS

Differential Revision: D19409649

Pulled By: albanD

fbshipit-source-id: a2b41c2d231e952ecfe162bdb6bad620ac595703
2020-01-24 14:32:28 -08:00
9e59244b53 fix view listing in autograd codegen (#32044)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32044

Fix the list of views in the codegen:
- Move `narrow` out of the autograd functions since it's now implemented with slice.
- Add `split_with_sizes` that was missing from the list
- Remove special formulas for both `split` and `split_with_sizes`. Both used not to be considered as views. When they are, all the rnn code breaks because it uses them in an invalid way. The generic formula will generate one `narrow` Node for each output. Which is always valid.

The diff for the generated code can be found [here](https://github.com/pytorch/pytorch/compare/16eff6e...albanD:06d6e85) (outdated for last commit)

Test Plan: Imported from OSS

Differential Revision: D19409648

Pulled By: albanD

fbshipit-source-id: 5ebc4c978af500403f7f008c0231b7db0cabab26
2020-01-24 14:31:21 -08:00
d2bda53f6d [quant][graphmode] Call _jit_pass_dedup_module_ueses in quantize_script (#32303)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32303

att

Test Plan:
.

Imported from OSS

Differential Revision: D19508029

fbshipit-source-id: 468ed53fc8bb3c8fdf5d79aea186949e64be711a
2020-01-24 13:34:40 -08:00
fe3eb09da5 [quant] Re-enable fold_convbn in quantize_script (#32302)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32302

att

Test Plan:
.

Imported from OSS

Differential Revision: D19508035

fbshipit-source-id: 2ac26585396ec8a115acd0e1d7ccb84098a76824
2020-01-24 13:03:53 -08:00
fd1a4f18ee [pytorch] update code analyzer build.sh to handle srcs with same name (#32525)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32525

Before calling static code analyzer we need link all bitcode files into
a single module. Current approach is a bit hacky: cmake still calls "ar"
to pack bitcode files into archives, then we manually unpack these
archives and call llvm-link.

Turns out libtorch_cpu.a contains a few files with same name, e.g.:
```
aten/src/ATen/native/SoftMax.cpp
aten/src/ATen/native/mkldnn/SoftMax.cpp
```

"ar x" will only keep one of them and cause inaccurate analysis result.

Use this temporary hack to workaround the problem. Ideally should merge
this step into cmake (e.g. directly calling llvm-link to produce target
output?).

Differential Revision: D19530533

Pulled By: ljk53

fbshipit-source-id: 94b292c241abaaa0ff4a23059882abdc3522971e
2020-01-24 12:37:30 -08:00
ef5637f85e [jit] allow compilation using optional modules (#32539)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32539

Before: if something in `_modules` was `None`, we would barf. This is
incorrect because it's allowed for users to put `None` there, in case a
module is optional.

This case ought to be handled correctly during scripting. Fixes https://github.com/pytorch/pytorch/issues/32469

Test Plan: Imported from OSS

Differential Revision: D19552346

Pulled By: suo

fbshipit-source-id: aba7fdc19fd84d195c81cdaca8a75013a8626a8b
2020-01-24 11:51:47 -08:00
7d0f0b62de API for testing bailouts (#32518)
Summary:
This API seems to be quite useful to make sure all bailouts in a graph are triggered. I used it for testing torchvision models and I was wondering if this might be something we might actually want to have? zdevito
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32518

Differential Revision: D19553147

Pulled By: Krovatkin

fbshipit-source-id: 7542c99051588b622091aec6d041c70731ca5d26
2020-01-24 11:19:41 -08:00
f0c85571ed docker: Refactor Dockerfile process for official images (#32515)
Summary:
## Commit Message:

Refactors Dockerfile to be as parallel as possible with caching and adds a new Makefile to build said Dockerfile.

Also updated the README.md to reflect the changes as well as updated some of the verbage around running our latest Docker images.

Adds the new Dockerfile process to our CircleCI workflows

## How to build:

Building the new images is pretty simple, just requires `docker` > 18.06 since the new build process relies on `buildkit` caching and multi-stage build resolving.

### Development images
For `runtime` images:
```
make -f docker.Makefile runtime-image
```

For `devel` images:
```
make -f docker.Makefile devel-image
```

Builds are tagged as follows:
```bash
docker.io/${docker_user:-whoami}/pytorch:$(git describe --tags)-${image_type}
```

Example:
```
docker.io/seemethere/pytorch:v1.4.0a0-2225-g9eba97b61d-runtime
```

### Official images

Official images are the ones hosted on [`docker.io/pytorch/pytorch`](https://hub.docker.com/r/pytorch/pytorch)

To do official images builds you can simply add set the `BUILD_TYPE` variable to `official` and it will do the correct build without building the local binaries:

Example:
```
make -f docker.Makefile BUILD_TYPE=official runtime-image
```

## How to push:

Pushing is also super simple (And will automatically tag the right thing based off of the git tag):

```
make -f docker.Makefile runtime-push
make -f docker.Makefile devel-push
```
Signed-off-by: Eli Uriegas <eliuriegas@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32515

Differential Revision: D19558619

Pulled By: seemethere

fbshipit-source-id: a06b25cd39ae9890751a60f8f36739ad6ab9ac99
2020-01-24 10:27:20 -08:00
8fd3eaed25 [jit] Fix dict type serialization (#32569)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32569

If the dict's contained types cannot be inferred from its contents (for
example, `Dict[str, Tensor]` vs. `Dict[str, Optional[Tensor]]`), we must
explicitly annotate the type.

Also this removes some special handling that omits annotations on empty
containers that have the default type. It makes the code more complex
for not too much value, and was wrong for dicts anyway.

Test Plan: Imported from OSS

Differential Revision: D19551016

Pulled By: suo

fbshipit-source-id: c529b112e72c10f509a6bc0f5876644caa1be967
2020-01-24 03:19:55 -08:00
3ada2e0d64 [pytorch][embeddingbag] Parallelize the EmbeddingBag operator (#4049)
Summary:
Pull Request resolved: https://github.com/pytorch/glow/pull/4049

Pull Request resolved: https://github.com/pytorch/pytorch/pull/27477

We would like to add the intra-op parallelization support for the EmbeddingBag operator.

This should bring speedup for the DLRM benchmark:
https://github.com/pytorch/pytorch/pull/24385

Benchmark code:
```
from __future__ import absolute_import, division, print_function, unicode_literals

import torch
import time

eb = torch.nn.EmbeddingBag(1000000, 64, mode='sum')

input = torch.LongTensor(1500).random_(0, 1000000)
offsets = torch.zeros(64, dtype=torch.int64)

niter = 10000
s = time.time()
for _ in range(niter):
    out = eb(input, offsets)
time_per_iter = (time.time() - s) / niter
print('time_per_iter', time_per_iter)
print('GB/s', (input.numel() * 64 * 4 + out.numel() * 4) / time_per_iter / 1e9)
```

The following results are single core on Skylake T6:
- Before our change (with the original caffe2::EmbeddingLookup)
time_per_iter 6.313693523406982e-05
GB/s 6.341517821789133

- After our change using the EmbeddingLookupIdx API which takes the offsets instead of lengths.
time_per_iter 5.7627105712890626e-05
GB/s 6.947841559053659

- With Intel's PR: https://github.com/pytorch/pytorch/pull/24385
time_per_iter 7.393271923065185e-05
GB/s 5.415518381664018

For multi-core performance, because Clang doesn't work with OMP, I can only see the single-core performance on SKL T6.
ghstack-source-id: 97124557

Test Plan:
With D16990830:
```
buck run mode/dev //caffe2/caffe2/perfkernels:embedding_bench
```

With D17750961:
```
buck run mode/opt //experimental/jianyuhuang/embeddingbag:eb
buck run mode/opt-lto //experimental/jianyuhuang/embeddingbag:eb
```

OSS test
```
python run_test.py -i nn -- TestNNDeviceTypeCPU.test_EmbeddingBag_per_sample_weights_and_new_offsets_cpu
```

Buck test
```
buck test mode/dev-nosan //caffe2/test:nn -- "test_EmbeddingBag_per_sample_weights_and_new_offsets_cpu"

OMP_NUM_THREADS=3 buck test mode/opt -c pytorch.parallel_backend=tbb //caffe2/test:nn -- "test_EmbeddingBag_per_sample_weights_and_new_offsets"  --print-passing-details
```

Generate the AVX2 code for embedding_lookup_idx_avx2.cc:
```
python hp_emblookup_codegen.py --use-offsets
```

Differential Revision: D17768404

fbshipit-source-id: 8dcd15a62d75b737fa97e0eff17f347052675700
2020-01-23 21:29:44 -08:00
b474c351dd [rpc] Remove template on RRef and add Type to RRef creation (#30630)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30630

This remove template and all the specializations it have in rpc, we
universally use IValue as the inner value since we support making python
object to be hold inside IValue.

This will also ensure that we have the correct type information when
creating the RRef, we use the return type from the schema when creating
userRRef and OwnerRRef, it will enable IValue to always have the correct
type if the IValue is the RRef object (next PR)

Test Plan: Imported from OSS

Differential Revision: D19502235

fbshipit-source-id: 0d5decae8a9767e0893f3b8b6456b231653be3c5
2020-01-23 21:15:46 -08:00
ef2d4e67d1 Updating submodules
Summary:
GitHub commits:

08e28edc08
6884ecfc67
685144514f
ed665880aa

Test Plan: n/a

Reviewed By: zpao

fbshipit-source-id: 7b19dca06ad7e8751de21efc48f5eada37b446fb
2020-01-23 21:09:43 -08:00
6f146e1768 [JIT] Remove capsule type handling of node hashing (#32540)
Summary:
Capsule Type doesn't appear in the IR, it is purely used in the runtime. So we should not have to handle it node hashing... Let's see if this breaks anything.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32540

Differential Revision: D19541357

Pulled By: eellison

fbshipit-source-id: 905ed9f89cf6d03b45ddb4fde02adfa149b477f8
2020-01-23 17:44:28 -08:00
d2f66083c5 porting gather to ATen using TensorIterator with multithreading support. (#32425)
Summary:
Fixes [https://github.com/pytorch/pytorch/issues/24702](https://github.com/pytorch/pytorch/issues/24702).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32425

Differential Revision: D19538265

Pulled By: ngimel

fbshipit-source-id: 78821a16b6948916e956a04f984e0956f86cf582
2020-01-23 16:14:47 -08:00
4cd6b5cda6 [quant] Re-enable test_nested that has different qconfig for shared ClassType (#32206)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32206

att

Test Plan:
python test/test_quantization.py

Imported from OSS

Differential Revision: D19508028

fbshipit-source-id: 5de3c2ef17de146feca03d7135a7e04f393de398
2020-01-23 15:32:57 -08:00
6745bfc31c Revert "Remove __torch__ from custom class qualname" (#32514)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32514

This reverts commit c7fdf5b251c6fecd5d78b4f33d30bd77ca3f841c.

Test Plan: Imported from OSS

Differential Revision: D19525532

Pulled By: jamesr66a

fbshipit-source-id: 126f4e87250a2ac739bd7aa161a0f7b39f143d38
2020-01-23 14:56:25 -08:00
8ed1dd528e [JIT] Add torch.classes.load_library
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/32508

Test Plan: Imported from OSS

Differential Revision: D19525175

Pulled By: jamesr66a

fbshipit-source-id: b9f07113f551bdfb56d49d24d12989be2b8fc7e4
2020-01-23 14:56:20 -08:00
69f9bf8893 [JIT] Support returning tuple from custom bound C++ method
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/32477

Test Plan: Imported from OSS

Differential Revision: D19509927

Pulled By: jamesr66a

fbshipit-source-id: 7d407150402cc19344c3ec3b4a27b3d7c464e8ac
2020-01-23 14:56:15 -08:00
ae42e232ce [JIT] Fix custom class method binding for const methods
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/32471

Test Plan: Imported from OSS

Differential Revision: D19508249

Pulled By: jamesr66a

fbshipit-source-id: 3a0bce6845072bb03567049a73b9982b54d8daf9
2020-01-23 14:56:11 -08:00
7e14c420ae [JIT] Test __getstate__ and __setstate__ for custom bound C++ classes
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/32470

Test Plan: Imported from OSS

Differential Revision: D19508250

Pulled By: jamesr66a

fbshipit-source-id: 481299fb3c18fa874c2a1d2993984bb6b3193bac
2020-01-23 14:56:06 -08:00
dbd29e5668 [JIT] Passing custom class as arg (#32260)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32260

This makes it so you can actually pass the custom class as an arg to ScriptFunctions

Test Plan: Imported from OSS

Differential Revision: D19424252

Pulled By: jamesr66a

fbshipit-source-id: c3530186619655781dedbea03c2ad321aaff1cb8
2020-01-23 14:54:59 -08:00
ad4fba0ce4 Only run test_conv_large and test_conv_transposed_large_cuda on 32GB device (#32473)
Summary:
For some reason, these two tests start to fail on 16GB Volta on Linux...

Also fixes https://github.com/pytorch/pytorch/issues/31650
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32473

Differential Revision: D19538314

Pulled By: ngimel

fbshipit-source-id: 266195f19d8cf76b035795e0e318c152ae72adc2
2020-01-23 14:50:24 -08:00
49cd83d735 no more build_pytorch_libs.sh/.bat (#32319)
Summary:
https://github.com/pytorch/pytorch/issues/12918
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32319

Differential Revision: D19544272

Pulled By: soumith

fbshipit-source-id: dd32fa61efa78af908f21c7e54cb6484bf895e54
2020-01-23 14:45:54 -08:00
d234626267 [quant][graphmode] Support quantizing shared ClassType with different qconfigs (#32205)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32205

to be filled

Test Plan:
python test_jit.py

Imported from OSS

Differential Revision: D19508031

fbshipit-source-id: cbf03d34e52eae62595c34fde6ec645cb6744ad9
2020-01-23 14:32:55 -08:00
ef94496b36 [JIT] throw if no self arg on ignored methods (#32503)
Summary:
There was a user who did this and it would seg fault.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32503

Differential Revision: D19538481

Pulled By: eellison

fbshipit-source-id: dc3752028b9eff6ac88c025e8a2b5f8fd44ce32f
2020-01-23 14:27:00 -08:00
db02a4e4ce Support 3D attention mask in MultiheadAttention. (#31996)
Summary:
Support a 3D attention mask for MultiheadAttention. If `attn_mask` has the batch dimension, it will not be unsqueezed. Fix https://github.com/pytorch/pytorch/issues/30678
Relevant issues/pr:
https://github.com/pytorch/pytorch/pull/25359
https://github.com/pytorch/pytorch/issues/29520
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31996

Differential Revision: D19332816

Pulled By: zhangguanheng66

fbshipit-source-id: 3448af4b219607af60e02655affe59997ad212d9
2020-01-23 13:16:48 -08:00
b6b8620871 Add unit test on export_opnames with interface. (#31531)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31531

As suggested by suo , add unit test on torch.jit.export_opnames with interface. A submodule is annotated as interface and assigned to an instance, and then re-assigned to another instance. Make sure the operator names are also updated.

Test Plan: Imported from OSS

Differential Revision: D19539129

Pulled By: iseeyuan

fbshipit-source-id: 71a76ae7790cdd577618ca278afdb132727f08dc
2020-01-23 12:27:22 -08:00
9af5a97b1d Fix nll_loss to support empty tensors on GPU (#31491)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31491

Fixes #31472

Test Plan: Imported from OSS

Differential Revision: D19537231

Pulled By: pbelevich

fbshipit-source-id: 20a43251a0f68a7a3557dd8234daee2d4814e5dd
2020-01-23 11:45:59 -08:00
583bb97618 [quant][graphmode] Default to non-inplace in graph mode quantization API (#32204)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32204

att

Test Plan:
.

Imported from OSS

Differential Revision: D19508030

fbshipit-source-id: 94814c3c126a196f3938f944abfa5ae2a24d8dde
2020-01-23 10:39:46 -08:00
ea7bebb7fe [PyTorch BC] Clean up the whitelist for PyTorch Op BC check (#32523)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32523

remove stale items

Test Plan: cont build

Reviewed By: hl475

Differential Revision: D19526918

fbshipit-source-id: ee7392ae84e5ddf88284020775119e59c9b6533e
2020-01-23 09:25:37 -08:00
02aa3ba331 Raise error for code that risk deadlock (#32295)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32295

Fix for https://github.com/pytorch/pytorch/issues/32045

Calling into the engine with the GIL can deadlock because:
- worker thread initialization acquires the GIL
- Any Node / hook can be a python function that will acquire the GIL

The choice was made here to raise an error as one of the advantage of using cpp extensions with python is to be able to release the GIL. So we prefer to educate users to do it rather than doing it under the hook.

Test Plan: Imported from OSS

Differential Revision: D19430979

Pulled By: albanD

fbshipit-source-id: e43f57631885f12e573da0fc569c03a943cec519
2020-01-23 08:53:59 -08:00
21d475e20d [gloo] Skip registry warning (#31126)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31126

Gloo device creator registry is throwing warning that confuses users - https://fb.workplace.com/groups/1405155842844877/permalink/3217491788277931/
Create C10_DEFINE_SHARED_REGISTRY_WITHOUT_WARNING API to skip such warning

Test Plan:
{F224342749}

Tested both `C10_DEFINE_SHARED_REGISTRY` and `C10_DEFINE_SHARED_REGISTRY_WITHOUT_WARNING`.
Make sure nothing breaks

Reviewed By: d4l3k

Differential Revision: D18904783

fbshipit-source-id: 0e0065d530956249a18325d4ed3cb58dec255d4c
2020-01-22 22:46:27 -08:00
f050b16dd9 Move pytorch distributed tests to separate folder for contbuild. (#30445)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30445

Create distributed and rpc directories under caffe/test for better management
of unit tests.

Differential Revision: D18702786

fbshipit-source-id: e9daeed0cfb846ef68806f6decfcb57c0e0e3606
2020-01-22 21:16:59 -08:00
e735395fc6 [caffe2] use 2-stage EmbeddingSpMDM interface (#32271)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32271

Use the 2-stage EmbeddingSpMDM interface in D19425982 to reduce the overhead of code cache lookup and lock contention.
Fix an issue in sparse_lengths_sum_benchmarks generating empty indices when average length is small like 1.

Test Plan: CI

Reviewed By: dskhudia

Differential Revision: D19425987

fbshipit-source-id: d5c5f0d46e0072403901809c31d516fa0f4b9b31
2020-01-22 19:05:36 -08:00
685f090ac8 [Rowwise Pruning][c2 op] Add Quantile Op (#32448)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32448

Using binary search to compute the value for the given quantile among the input tensors.

Test Plan: Newly added unittests;

Reviewed By: jspark1105

Differential Revision: D19487604

fbshipit-source-id: 0dc6627b78d1310ac35b3f1d53b89cc89a697ece
2020-01-22 16:59:56 -08:00
4bdfc71421 Fix race condition for to() backward that spans devices (#31930)
Summary:
While putting finishing touches on the gradient scaling PR (https://github.com/pytorch/pytorch/pull/26512), I discovered my multi-GPU test (which uses `to()` to transfer tensors between devices) was intermittently failing with bad numerics.  I knew it was going to be [a weird case from the start](https://www.imdb.com/title/tt8946378/quotes/qt4868203) and spent a week descending into madness.  It turns out, for backward ops that create gradients on a different device from the device on whose stream the op is executed, the streaming backward synchronizations in [input_buffer.cpp](https://github.com/pytorch/pytorch/blob/master/torch/csrc/autograd/input_buffer.cpp#L46-L83) do not properly tell later ops to wait on the population/creation of those gradients.  For example, a cross-device `to()` backward (CopyBackward Node) enqueues a cudaMemcpyAsync on the current stream of the source (incoming gradient's) device, then [syncs getCurrentCUDAStream on the destination device with the cudaMemcpyAsync](https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/native/cuda/Copy.cu#L76).  However, `input_buffer.cpp` in such cases ([case (3)](https://github.com/pytorch/pytorch/blob/master/torch/csrc/autograd/input_buffer.cpp#L77-L81)) was not properly telling `opt_consumer_stream` to wait on the current stream of the destination device (`var`'s device).

Circumstances needed to repro in current master (see [my test](https://github.com/pytorch/pytorch/compare/master...mcarilli:backward_to_race_fix#diff-e68a7bc6ba14f212e5e7eb3727394b40R1901)):
- 2 devices, with non-default streams used for forward-pass ops on both devices (which is the default behavior in test_cuda.py)
- A `to()` that transfers a tensor requiring grad from one device to another
- A backward pass that routes back through to()'s backward (aka CopyBackward).

Under these circumstances, backward ops following CopyBackward on CopyBackward's destination device (aka the original forward-pass source device) race with the device-to-device transfer, and execute using partially-transferred data.

The present PR fixes the race condition and ensures that later ops wait on the CopyBackward transfer.  This PR should also make streaming backward safe for other backward ops that span devices, as long as they play nice and populate any new gradients they create using the "current stream" of the device(s) on which they create those gradients.

There are a couple minor issues where I'm not sure of the best approach:
- Should we guard onto the var's device for the entire body of InputBuffer::add?
- I'm fairly sure we need to `recordStream` on `var` if the consumer stream is different from the stream on which (we expect) `var` was created, but calling `c10::cuda::CUDACachingAllocator::recordStream` in input_buffer.cpp might break CPU-only builds.  I couldn't find a different API call to record streams that seemed CPU-build-agnostic.  Could I wrap the call with a macro?

Thanks to mruberry for helpful suggestions and also the organization/naming of the stream pool and streaming backward code that allowed me to (just barely) wrap my head around the issue.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31930

Differential Revision: D19517617

Pulled By: mruberry

fbshipit-source-id: 183d5460aefa5d27366b465b0473b80ec80fa044
2020-01-22 16:32:24 -08:00
193ac31441 [jit] Enable IValue to hold a PyObject (#32491)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32491

This PR enables IValue to be able to hold a pure PyObject by adding a
new enum tag, a new jit_type to denote PyObject existance in IValue and
the JIT type system. We don't and not plan to expose this to user.

This is the basic piece that enable ivalue to be adopted broader like
making RRef always hold IValue, it might also simplify some compiler
logic
ghstack-source-id: 97039980

Test Plan: Imported from OSS

Differential Revision: D19502234

fbshipit-source-id: 90be001706d707d376cfbea25980fd82980df84a
2020-01-22 15:48:32 -08:00
556c0b063d Updating submodules
Summary:
GitHub commits:

87b81e7cb2
3a9a0976f2
9294f3b2fa
c8addc5ad4
9a9f1a849a
27cb280170

Test Plan: n/a

Reviewed By: zpao

fbshipit-source-id: 73beec64bf9c17fa6c42dd09ea85350e8c9c66ea
2020-01-22 15:30:31 -08:00
14e0bec9f2 [caffe2] remove unnecessary np.set_printoptions and fix test errors (#32475)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32475

As title

Test Plan: CI

Reviewed By: houseroad

Differential Revision: D19508778

fbshipit-source-id: fd9ad63607535980505d155f3e3c3b7c6b95daf7
2020-01-22 14:49:47 -08:00
faffd2141a Corrected logical boolean expression (#32249)
Summary:
Changed bitwise & to logical && in the boolean expression.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32249

Differential Revision: D19501586

Pulled By: eellison

fbshipit-source-id: afe374cfc9661182703cc82810d9cb735fbb8180
2020-01-22 13:54:16 -08:00
43eb931c0f Remove mis-exposed abort API on ProcessGroup
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/32292

Test Plan: Imported from OSS

Differential Revision: D19430252

Pulled By: mrshenli

fbshipit-source-id: 4ec594e1be54afe774bdcecc0f1c9bda2edf5e0d
2020-01-22 12:51:20 -08:00
b7c6277c53 Adding QConfigTypePtrMap (#32203)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32203

The type is needed for allowing multiple qconfig configurations for shared
ClassType, see next PR for more details

Test Plan:
.

Imported from OSS

Differential Revision: D19508027

fbshipit-source-id: a3df29dab3038bfa88c55dda98a3e8a78e99e5a1
2020-01-22 12:40:12 -08:00
38d122eca9 implement tuple constants (#31841)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31841

Add Tuple Constants to JIT. The constraint here is that all elements of a tuple must themself be insertable as a a constant. Previously tuples were special cased in constant propagation, but now that there are more passes that are inserted constants, such as freezing, we should just have tuples be representable as constants.

Test Plan: Imported from OSS

Differential Revision: D19439514

Pulled By: eellison

fbshipit-source-id: 3810ba08ee349fa5598f4b53ea64525996637b1a
2020-01-22 12:13:31 -08:00
69492ad6ac remove tuple logic in constant propagation (#31840)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31840

The next PR in this stack makes tuples insertable as constants, so we can remove special handling of tuples in constant propagation.

Test Plan: Imported from OSS

Differential Revision: D19439515

Pulled By: eellison

fbshipit-source-id: c58f153157f1d4eee4c1242decc4f36e41c1aa05
2020-01-22 12:13:26 -08:00
b01d824a78 improve mayContainAlias (#31839)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31839

There are a number of improvements that can be made to `mayContainAlias`, which I would like to do in follow ups. For now, this is an easy one.

Test Plan: Imported from OSS

Differential Revision: D19439516

Pulled By: eellison

fbshipit-source-id: 0042fb7eaae6cfb4916bf95dc38280517a4bd987
2020-01-22 12:13:20 -08:00
adf0916606 Add str[] float[] constants resubmit
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/31791

Test Plan: Imported from OSS

Reviewed By: driazati

Differential Revision: D19439513

Pulled By: eellison

fbshipit-source-id: a04c7401687b051f0d4fb4794963931ebe004194
2020-01-22 12:11:58 -08:00
e184a8843c Fix comparisions for ConcreteModuleType (#32256)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32256

Previously two unrelated modules loaded from torch.jit.load
would compare equal because we only considered their data_ attributes which
are initialized blank in torch.jit.load. This changes ConcreteModuleType
to distinguish when the data_ attribute is blank vs when it is empty.

This replaces the poisoned logic.
ghstack-source-id: 96755797

Test Plan: oss

Differential Revision: D19423055

fbshipit-source-id: 79d6a50a3731c6eeb8466ba2a93702b49264bba0
2020-01-22 11:59:38 -08:00
8e689378c7 Move some of the helper functions for public use (#32202)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32202

Move some helper functions in ModuleUseDeduper for public use

Test Plan:
.

Imported from OSS

Differential Revision: D19508034

fbshipit-source-id: 2e8e05eff6f3bbcfe6936598371e4afa72f9b11f
2020-01-22 11:35:37 -08:00
510a122d27 add missing align_corners annotation (#32492)
Summary:
adds the missing annotation in grid_sample and affine_grid functional
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32492

Differential Revision: D19516550

Pulled By: ezyang

fbshipit-source-id: 064c8c99bf6eae6744237c0b151b3ce4c82ada96
2020-01-22 11:29:07 -08:00
1c017f0c14 Migrate max and min (binary) from TH to ATen. (#30851)
Summary:
TH implementation will be removed after the unary max and min are
migrated.

Benchmark: (Debian 10, Release build, gcc 7.4, no turbo)

```python
import timeit
for device in ('cpu', 'cuda'):
    print(f'device: {device}')
    for op in ('max', 'min'):
        for dtype in ('torch.double', 'torch.float', 'torch.int16',
'torch.int32', 'torch.int64'):
            for n, t in [(10_000, 200000),
                        (100_000, 20000)]:
                print(f'torch.{op}(a, b), numel() == {n} for {t} times,
dtype={dtype}')
                print(timeit.timeit(f'torch.{op}(a)' +
(';torch.cuda.synchronize()' if device == 'cuda' else ''),
                                    setup=f'import torch; a =
torch.arange({n}, dtype={dtype}); b = torch.ones({n}, 0, dtype={dtype})
* ({n} / 2)', number=t))
    print()
```

Before:

```
device: cpu
torch.max(a, b), numel() == 10000 for 200000 times, dtype=torch.double
2.241763713000182
torch.max(a, b), numel() == 100000 for 20000 times, dtype=torch.double
1.7138833169992722
torch.max(a, b), numel() == 10000 for 200000 times, dtype=torch.float
2.2183356810000987
torch.max(a, b), numel() == 100000 for 20000 times, dtype=torch.float
1.7031846980007685
torch.max(a, b), numel() == 10000 for 200000 times, dtype=torch.int16
1.7704679510006827
torch.max(a, b), numel() == 100000 for 20000 times, dtype=torch.int16
1.289198366999699
torch.max(a, b), numel() == 10000 for 200000 times, dtype=torch.int32
1.7937613740014058
torch.max(a, b), numel() == 100000 for 20000 times, dtype=torch.int32
1.2930124340000475
torch.max(a, b), numel() == 10000 for 200000 times, dtype=torch.int64
1.8032857640009752
torch.max(a, b), numel() == 100000 for 20000 times, dtype=torch.int64
1.2908709189996443
torch.min(a, b), numel() == 10000 for 200000 times, dtype=torch.double
1.8829010000008566
torch.min(a, b), numel() == 100000 for 20000 times, dtype=torch.double
1.2994690759987861
torch.min(a, b), numel() == 10000 for 200000 times, dtype=torch.float
1.8037853410005482
torch.min(a, b), numel() == 100000 for 20000 times, dtype=torch.float
1.2929310759991495
torch.min(a, b), numel() == 10000 for 200000 times, dtype=torch.int16
1.8075240359994496
torch.min(a, b), numel() == 100000 for 20000 times, dtype=torch.int16
1.2932477679987642
torch.min(a, b), numel() == 10000 for 200000 times, dtype=torch.int32
1.7868400779989315
torch.min(a, b), numel() == 100000 for 20000 times, dtype=torch.int32
1.2885970789993735
torch.min(a, b), numel() == 10000 for 200000 times, dtype=torch.int64
1.8389664830010588
torch.min(a, b), numel() == 100000 for 20000 times, dtype=torch.int64
1.29402057399966

device: cuda
torch.max(a, b), numel() == 10000 for 200000 times, dtype=torch.double
4.787109836999662
torch.max(a, b), numel() == 100000 for 20000 times, dtype=torch.double
1.842438002999188
torch.max(a, b), numel() == 10000 for 200000 times, dtype=torch.float
3.429616614999759
torch.max(a, b), numel() == 100000 for 20000 times, dtype=torch.float
1.835390076999829
torch.max(a, b), numel() == 10000 for 200000 times, dtype=torch.int16
2.940423873000327
torch.max(a, b), numel() == 100000 for 20000 times, dtype=torch.int16
1.4108991760003846
torch.max(a, b), numel() == 10000 for 200000 times, dtype=torch.int32
2.9318018840003788
torch.max(a, b), numel() == 100000 for 20000 times, dtype=torch.int32
1.4168134739993548
torch.max(a, b), numel() == 10000 for 200000 times, dtype=torch.int64
2.9610764919998473
torch.max(a, b), numel() == 100000 for 20000 times, dtype=torch.int64
1.4189234130008117
torch.min(a, b), numel() == 10000 for 200000 times, dtype=torch.double
2.960172712999338
torch.min(a, b), numel() == 100000 for 20000 times, dtype=torch.double
1.4162539499993727
torch.min(a, b), numel() == 10000 for 200000 times, dtype=torch.float
2.8985912560001452
torch.min(a, b), numel() == 100000 for 20000 times, dtype=torch.float
1.4113489299998037
torch.min(a, b), numel() == 10000 for 200000 times, dtype=torch.int16
2.9160250799995993
torch.min(a, b), numel() == 100000 for 20000 times, dtype=torch.int16
1.4128787690005993
torch.min(a, b), numel() == 10000 for 200000 times, dtype=torch.int32
2.8806865219994506
torch.min(a, b), numel() == 100000 for 20000 times, dtype=torch.int32
1.4086357010000938
torch.min(a, b), numel() == 10000 for 200000 times, dtype=torch.int64
2.9362181240012433
torch.min(a, b), numel() == 100000 for 20000 times, dtype=torch.int64
1.4151225870009512

```

After:

```
device: cpu
torch.max(a, b), numel() == 10000 for 200000 times, dtype=torch.double
2.2685823729998447
torch.max(a, b), numel() == 100000 for 20000 times, dtype=torch.double
1.72004808300062
torch.max(a, b), numel() == 10000 for 200000 times, dtype=torch.float
2.212242640000113
torch.max(a, b), numel() == 100000 for 20000 times, dtype=torch.float
1.7089235590001408
torch.max(a, b), numel() == 10000 for 200000 times, dtype=torch.int16
1.7767087259999244
torch.max(a, b), numel() == 100000 for 20000 times, dtype=torch.int16
1.2916517639996528
torch.max(a, b), numel() == 10000 for 200000 times, dtype=torch.int32
1.8265984959998605
torch.max(a, b), numel() == 100000 for 20000 times, dtype=torch.int32
1.3002885240002797
torch.max(a, b), numel() == 10000 for 200000 times, dtype=torch.int64
1.8084679720004715
torch.max(a, b), numel() == 100000 for 20000 times, dtype=torch.int64
1.3012119999993956
torch.min(a, b), numel() == 10000 for 200000 times, dtype=torch.double
1.8800218449996464
torch.min(a, b), numel() == 100000 for 20000 times, dtype=torch.double
1.3060645710002063
torch.min(a, b), numel() == 10000 for 200000 times, dtype=torch.float
2.4905043950002437
torch.min(a, b), numel() == 100000 for 20000 times, dtype=torch.float
1.9126290209997023
torch.min(a, b), numel() == 10000 for 200000 times, dtype=torch.int16
1.7972335520007618
torch.min(a, b), numel() == 100000 for 20000 times, dtype=torch.int16
1.2918074379995232
torch.min(a, b), numel() == 10000 for 200000 times, dtype=torch.int32
1.8047651860006226
torch.min(a, b), numel() == 100000 for 20000 times, dtype=torch.int32
1.2992197730000044
torch.min(a, b), numel() == 10000 for 200000 times, dtype=torch.int64
1.8526509560006161
torch.min(a, b), numel() == 100000 for 20000 times, dtype=torch.int64
1.3030709570002728

device: cuda
torch.max(a, b), numel() == 10000 for 200000 times, dtype=torch.double
4.700986622000528
torch.max(a, b), numel() == 100000 for 20000 times, dtype=torch.double
1.8415469050005413
torch.max(a, b), numel() == 10000 for 200000 times, dtype=torch.float
3.3051693249999516
torch.max(a, b), numel() == 100000 for 20000 times, dtype=torch.float
1.8321999460004008
torch.max(a, b), numel() == 10000 for 200000 times, dtype=torch.int16
2.8086475109994353
torch.max(a, b), numel() == 100000 for 20000 times, dtype=torch.int16
1.405110773999695
torch.max(a, b), numel() == 10000 for 200000 times, dtype=torch.int32
2.913458047999484
torch.max(a, b), numel() == 100000 for 20000 times, dtype=torch.int32
1.4236377289998927
torch.max(a, b), numel() == 10000 for 200000 times, dtype=torch.int64
2.9386842409994642
torch.max(a, b), numel() == 100000 for 20000 times, dtype=torch.int64
1.4230227469997772
torch.min(a, b), numel() == 10000 for 200000 times, dtype=torch.double
3.0341797270002644
torch.min(a, b), numel() == 100000 for 20000 times, dtype=torch.double
1.4289592409995748
torch.min(a, b), numel() == 10000 for 200000 times, dtype=torch.float
3.6091147850002017
torch.min(a, b), numel() == 100000 for 20000 times, dtype=torch.float
2.036691903999781
torch.min(a, b), numel() == 10000 for 200000 times, dtype=torch.int16
2.8256167649997224
torch.min(a, b), numel() == 100000 for 20000 times, dtype=torch.int16
1.4078955400000268
torch.min(a, b), numel() == 10000 for 200000 times, dtype=torch.int32
2.8631781489993955
torch.min(a, b), numel() == 100000 for 20000 times, dtype=torch.int32
1.4210130069996012
torch.min(a, b), numel() == 10000 for 200000 times, dtype=torch.int64
3.0112479260005784
torch.min(a, b), numel() == 100000 for 20000 times, dtype=torch.int64
1.4297719679998409

```

Solve partly https://github.com/pytorch/pytorch/issues/24594 #24595

Close https://github.com/pytorch/pytorch/issues/25016

Continuing https://github.com/pytorch/pytorch/issues/27185
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30851

Differential Revision: D19515694

Pulled By: ezyang

fbshipit-source-id: 1764897f912d6ae24b0c361f19a1aacf96e0826e
2020-01-22 09:03:18 -08:00
b77c25dec0 Fix dll load logic for Python 3.8 on Windows (#32215)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/31181 and https://github.com/pytorch/pytorch/pull/31162#discussion_r362495611.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32215

Differential Revision: D19501869

Pulled By: ezyang

fbshipit-source-id: 363824e52d2592ad968ecf1df345aa4c0daff915
2020-01-22 08:33:34 -08:00
c342c354a9 Put sparse all reduce results to input tensors (#32226)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32226

right now if users call torch.dist.all_reduce() on dense tensors, outputs are put in input tensors. but if users call torch.dist.all_reduce() on sparse tensors, outputs are neither returned explicitly to users nor are put in input tensors.

To make torch.dist.all_reduce() API have same behavior on both dense tensors and sparse tensors, this diff is made to make torch.dist.all_reduce() on sparse tensors to put output in input tensors as well. This is acheived by simply calling input_sparse.copy_(output_sparse), see PR https://github.com/pytorch/pytorch/pull/9005 that implemented copy_ for sparse tensors.

close #31413
ghstack-source-id: 96984228

Test Plan: unit test

Differential Revision: D19192952

fbshipit-source-id: 2dd31dc057f20cc42b44b9e55df864afa2918c33
2020-01-22 08:06:56 -08:00
e37a24b044 Always return a new tensor from nn.functional.pad (#32350)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/31734
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32350

Differential Revision: D19501845

Pulled By: ezyang

fbshipit-source-id: ea79496d23dc0016f3caa233c53d283b08f60371
2020-01-22 08:03:42 -08:00
8abaa322da fix torch.eq() doc entry (#32399)
Summary:
fix `torch.eq()` entry example to match the current output (boolean, instead of uint8)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32399

Differential Revision: D19498104

Pulled By: ezyang

fbshipit-source-id: e7ec1263226766a5c549feed16d22f8f172aa1a3
2020-01-22 07:43:10 -08:00
248f6d0485 Implement backend fallback fallthrough (#32439)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32439

This adds c10::fallthrough_kernel which is a special boxed function which
can be used to implement fallthrough behavior at a dispatch key.  A fallthrough
kernel will redispatch to the next valid dispatch key.  It is implemented
in such a way that it costs no more to fallthrough than it does to go
straight to the actual implementation of the kernel.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Differential Revision: D19503886

Test Plan: Imported from OSS

Pulled By: ezyang

fbshipit-source-id: 6ee05bd815c4ef444e612d19f62312dbb76f2787
2020-01-22 07:32:08 -08:00
0d610b4821 Remove the support of build options like NO_*, WITH_* (#32447)
Summary:
We will now use USE_*, BUILD_* consistently. The backward compatibility
for NO_* and WITH_* is hereby removed in this commit, as promised in the
comment (next release is beyond Feb 20):

    # Before we run the setup_helpers, let's look for NO_* and WITH_* variables and hotpatch environment with the USE_*
    # equivalent The use of NO_* and WITH_* is deprecated and will be removed in Feb 20, 2020.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32447

Differential Revision: D19515536

Pulled By: ezyang

fbshipit-source-id: 2f2c51e6d4674af690b190a1f0397b8f596b6a15
2020-01-22 07:25:29 -08:00
44b270d892 insert_quant_dequant pass support shared class types (#31408)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31408

We'll error out when a graph is quantized with different QSchemes.
This only occurs when we have two modules that have same types (e.g. two Conv2d modules initialized with
same arguments) and quantized with two configs that would produce different quantized graphs, for example
per tensor affine and per channel affine. This is a rare case, so it should be OK to skip for now.
Actual support will come later.

Test Plan:
test_jit.py, test_quantization.py

Imported from OSS

Differential Revision: D19162366

fbshipit-source-id: 798f06d0ddef0c8458237ce88b62159cc77eec8b
2020-01-21 22:18:49 -08:00
60b6c99aa7 Updating submodules
Summary:
GitHub commits:

d2ee8a1a3f
a1543b168d

Test Plan: n/a

Reviewed By: zpao

fbshipit-source-id: a1394f1c4a48920d3ce1403c70351e2c56eaecf0
2020-01-21 19:18:29 -08:00
64de93d8e7 Move log_normal to Aten(CPU) (#31854)
Summary:
Fix https://github.com/pytorch/pytorch/issues/24723.
Benchmark script :
```
import torch
import torch.nn as nn
import time

torch.manual_seed(0)

def _time():
    return time.time()

device = "cpu"

#warm up
for n in [10, 100, 1000]:
    input = torch.randn(128, n, requires_grad=False, device=device)
    for i in range(1000):
        input.log_normal_()

for n in [1, 10, 100, 1000]:
    fwd_t = 0
    input = torch.randn(128, n, requires_grad=False, device=device)
    for i in range(10000):
        t1 = _time()
        input.log_normal_()
        t2 = _time()
        fwd_t = fwd_t + (t2 -t1)
    fwd_avg = fwd_t / 10000 * 1000
    print("input size(128, %d) forward time is %.4f (ms)." % (n, fwd_avg))
```
Test Device: skx-8180.
Before:
```
input size(128, 1) forward time is 0.0114 (ms).
input size(128, 10) forward time is 0.1021 (ms).
input size(128, 100) forward time is 1.0081 (ms).
input size(128, 1000) forward time is 10.1831 (ms).
```
After:
```
input size(128, 1) forward time is 0.0108 (ms).
input size(128, 10) forward time is 0.0969 (ms).
input size(128, 100) forward time is 0.9804 (ms).
input size(128, 1000) forward time is 9.6131 (ms).
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31854

Differential Revision: D19314586

Pulled By: pbelevich

fbshipit-source-id: 2ea1d9a2c505e36aca9e609b52ccb3e8caf2ba8f
2020-01-21 19:07:31 -08:00
4973695268 Updating submodules
Summary:
GitHub commits:

d45f7b4f09
e6e8b9e871
da618022d2
2df47f519a

Test Plan: n/a

Reviewed By: zpao

fbshipit-source-id: c4af09e70a56d11e845150ba3d90a570a3758e51
2020-01-21 17:16:46 -08:00
7fdc6cb74e Fix test_data_parallel name errors and add to run_test.py (#32428)
Summary:
While working on https://github.com/pytorch/pytorch/issues/31768 and trying to add tests for `DataParallel`, I discovered that:
- `test_data_parallel.py` can't be run through `run_test.py`
- running it with `pytest` fails with many name errors

`test_data_parallel.py` seems to have been split from `test_nn.py` in https://github.com/pytorch/pytorch/issues/28297 but not in a state where it can actually be run. Presumably `DataParallel` hasn't been tested by CI in the time since.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32428

Differential Revision: D19499345

Pulled By: ezyang

fbshipit-source-id: f9b748a99a5c85fc6675c22506cf10bbfd9c8a4d
2020-01-21 15:11:03 -08:00
0b606a4a7c Enhace DispatchStub to be thread safe from a TSAN point of view. (#32148)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32148

TSAN would complain about multiple threads reading and writing to the
`cpu_dispatch_ptr` without any sort of synchronization. Although, this is a
valid issue from a TSAN point of view, there wasn't a correctness issue since
both threads would compute the same value.

In order to fix this, I've used std::atomic for cpu_dispatch_ptr with relaxed
ordering guarantees.
ghstack-source-id: 96989435

Test Plan: Verify the TSAN tests pass.

Differential Revision: D19386082

fbshipit-source-id: 1ff0893e02529eddd06b2855d9565edf1bbf1196
2020-01-21 14:59:57 -08:00
be6ffac1b6 Adagrad optimizer - updated step function, added param_groups, state to optimizers
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/29335

Differential Revision: D19449382

Pulled By: anjali411

fbshipit-source-id: ee238801ed9cdf15a80f2ce31cc4aab8ba582aea
2020-01-21 14:41:12 -08:00
0ed04bfdf6 Updating submodules
Summary:
GitHub commits:

40b08129cf
8cd8d286e6
d305f13e21
2957bd45f1

Test Plan: n/a

Reviewed By: zpao

fbshipit-source-id: 3b76eb7c8b6b5cf617aca7bd143e1ee404c4f0ed
2020-01-21 14:11:17 -08:00
e1d97025ee QNNPACK: Add support for dynamic quantization.
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/31896

Test Plan: Added new tests to QNNPACK's test suite to cover the new use case.  All new tests are passing.

Reviewed By: supriyar

Differential Revision: D19443250

Pulled By: AshkanAliabadi

fbshipit-source-id: fa7b1cffed7266a3c198eb591d709f222141a152
2020-01-21 12:33:08 -08:00
bc6005281b Updating submodules
Summary:
GitHub commits:

47e0b9b97e
6d225aaf95
ab4da8f60a

Test Plan: n/a

Reviewed By: zpao

fbshipit-source-id: 27bcdf08b6f5e47a5c948e094aca26bf67a6fb66
2020-01-21 12:12:31 -08:00
9e853e7090 Revert "Temporary workaround for BC test due to schema parser changes" (#32441)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32441

This reverts commit ceffdbd2179e7dafdc6407909a00f4267db040de.

Test Plan: Imported from OSS

Reviewed By: houseroad

Differential Revision: D19500043

Pulled By: jamesr66a

fbshipit-source-id: 3bd22c55e4a81ff8b89d27f6e7438e3bdfc18606
2020-01-21 12:07:46 -08:00
f86d6c6afd Enhance NCCL watchdog to acitvely abort communicators for timed out ops. (#32338)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32338

Timed out ops could linger around if the user doesn't actually call
`wait()` on that OP. As result, to fix this I've introduced the following
functionality in this PR:

1. Keep track of all outstanding work in ProcessGroupNCCL.
2. Enhance NCCL watchdog to sweep through all outstanding work and perform the
following operations:
  i.   If the work has timed out, abort all communicators for that work and
       remove them from the cache.
  ii.  If the communicators for the work receive an error, abort the
       communicators and remove them from the cache.
  iii. If the work has completed (successfully/unsuccessfully), remove it from
       the list of outstanding work.
ghstack-source-id: 96895704

Test Plan: waitforbuildbot

Differential Revision: D19401625

fbshipit-source-id: 8f6f277ba2750a1e1aa03cdbc76e8c11862e7ce5
2020-01-21 12:05:40 -08:00
ec4be4e58c Redundant condition (#32396)
Summary:
Optimize expression: 'A || (!A && B)' <=> 'A || B'

A: relErr <= maxRelErr
!A : relErr > maxRelErr
B: absErr <= absErrForRelErrFailure
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32396

Differential Revision: D19499370

Pulled By: ezyang

fbshipit-source-id: c19bdcb2d4e7ff7806a8cd181c6e7e9e276b9979
2020-01-21 11:30:49 -08:00
839fe714de Fix BC test after TorchBind cahnges (#32429)
Summary:
It was broken by https://github.com/pytorch/pytorch/issues/32320. Let's be on the safe side and just whitelist all testing ops
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32429

Differential Revision: D19501016

Pulled By: dzhulgakov

fbshipit-source-id: 9cc1d363edb4579905bee1976a2b57255ce41738
2020-01-21 11:30:44 -08:00
e4f43bf7a5 Set rpath for JNI library on Mac (#32247)
Summary:
Without this, dlopen won't look in the proper directory for dependencies
(like libtorch and fbjni).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32247

Test Plan:
Build libpytorch_jni.dylib on Mac, replaced the one from the libtorch
nightly, and was able to run the Java demo.

Differential Revision: D19501498

Pulled By: dreiss

fbshipit-source-id: 13ffdff9622aa610f905d039f951ee9a3fdc6b23
2020-01-21 11:30:39 -08:00
9482683065 Remove dead includes in caffe2/test
Reviewed By: ezyang

Differential Revision: D19273220

fbshipit-source-id: 3dfc3388914e60611c84472e3fc529f5b5e40534
2020-01-21 11:30:34 -08:00
c13df8b688 Fix cusparse version check (#32405)
Summary:
The current version check doesn't use proper lexicographic comparison and so will break for future versions of cuSPARSE with `CUSPARSE_VER_MAJOR > 10` and `CUSPARSE_VER_MINOR < 2`. Also, my cusparse headers for CUDA 9 don't seem to include version macros at all, so added `if !defined` to be explicit about that.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32405

Differential Revision: D19499412

Pulled By: ezyang

fbshipit-source-id: 1593bf1e5a4aae8b75bb3b350d016cc6c3b9c009
2020-01-21 11:30:30 -08:00
9ce25cce91 add an option to record time spent waiting for GIL (#30842)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30842

We'd like to profile the time spent on GIL acqusiition to debug
performance issues.

Test Plan: Unit tests pass.

Differential Revision: D18837590

fbshipit-source-id: 925968f71c5fb96b8cd93f1eab4647602d2617d1
2020-01-21 11:29:23 -08:00
1177191c8e Synchronize with ShipIt.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
2020-01-21 13:39:28 -05:00
cc2d5b15ad F.normalize uses clamp_min_ inplace (#32360)
Summary:
We don't care about autograd when `out!=None` anyways
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32360

Differential Revision: D19452402

Pulled By: colesbury

fbshipit-source-id: c54775289f8a700019ca61e951d59ff4894ac980
2020-01-21 10:38:06 -08:00
0c03304bdf .circleci: Only run macos libtorch on master (#32378)
Summary:
These jobs were taking forver to run so we decided it's only really
worth it to run it on master.

Signed-off-by: Eli Uriegas <eliuriegas@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32378

Differential Revision: D19499301

Pulled By: seemethere

fbshipit-source-id: 22cac5b5baee84e44607a16daeb77048cb0f5974
2020-01-21 10:38:01 -08:00
a2641e6005 Make type of Tensor.type() more specific (#32353)
Summary:
Fixes the following issue:

```
$ cat test.py
import torch

t = torch.tensor(1.5)
t.type(torch.float32)[None]

$ mypy test.py
test.py:4: error: Invalid index type "None" for "Union[str, Tensor]"; expected type "Union[int, slice]"
Found 1 error in 1 file (checked 1 source file)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32353

Differential Revision: D19499388

Pulled By: ezyang

fbshipit-source-id: 715111e934aea020b20f850d27e32c4f70b82572
2020-01-21 10:37:56 -08:00
418ebc827b Build: Respect USE_CUDNN=0, even if cudnn is found (#32404)
Summary:
Currently, setting `USE_CUDNN=0` has no effect and any cudnn library found on your system will be used anyway. This is especially problematic when your system has multiple CUDA versions installed, and you are building with a version that lacks a matching cudnn. CMake will find any other cudnn versions and you end up with both CUDA versions added to your compiler include paths.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32404

Differential Revision: D19499425

Pulled By: ezyang

fbshipit-source-id: a9b3f6f9dc22033481c3c1c5999b1a7ef98468cb
2020-01-21 10:36:03 -08:00
ecbf6f99e6 Removed unused weight update in prepack. Moved zero point update to (#32254)
Summary:
qlinear/qconv to be consistent with data update.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32254

Differential Revision: D19422929

Pulled By: kimishpatel

fbshipit-source-id: 595a4f7d6fde4978c94f3e720ec8645f3f2bdb7a
2020-01-19 19:08:37 -08:00
b543e3cd6f support empty batch in group normalization (#32401)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32401

https://github.com/pytorch/pytorch/issues/12013

Test Plan: buck test mode/dev-nosan //caffe2/test:nn -- 'test_GroupNorm_empty'

Differential Revision: D19463720

fbshipit-source-id: 8ae44590fc5eeb1adc69a2345d7cc2187d3307ac
2020-01-19 19:04:54 -08:00
7fbfb7eef2 Updating submodules
Summary:
GitHub commits:

ea6039a6c9
0d30b8e0fc
7acedd4723
4db6e3b785
cd898afb5e
cf5dd11204
08bdcfd87e
fc84c09b8f
454d37976b
a22e6b8cb4

Test Plan: n/a

Reviewed By: yns88

fbshipit-source-id: b87550b26e69216be2a8e40870a6e7dab825261c
2020-01-19 03:30:58 -08:00
58234c0254 support torch script call over rpc (#32197)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32197

This is to reland https://github.com/pytorch/pytorch/pull/30063, the main change is to match a general exception and grep "pickle" error word in "test_script_functions_not_supported" unit test, as Python 3.5 and Python 3.6 throw different types of errors with different error message for the rpc call in the unit test.
[test all]This diff makes following changes:
1. Providing a new set of python rpc privated APIs, they can accept an annotated TorchScript call and this call can be serialized, deserialized and executed in C++ without GIL. These privated APIs will be binded to JIT in the future, and they are different from public APIs as future JIT binded private APIs will be able to accept qualified_name, not callables. These private APIs are subject to be deprecated once JIT supports torch script function to be a JIT type.

Also, these APIs require torch script function to be defined and annotated by users in python land, it can not be script class/module constructor or class/module methods.

2. This diff also allows public rpc APIs to accept an annotated TorchScript call and execute code path that above private APIs ran on. Therefore if users invoke an annotated TorchScript call over RPC, this call can be serialized, deserialized and executed in C++ without GIL as well.

3. The above private APIs call a newly defined C++ function to make rpc torch script call to be serialized, deserialized and executed in C++ land. This C++ function returns an ivalue::Future. so that in follow up diff this C++ function can be called when these privated APIs are binded to JIT.

4. script_call.cpp/.h and request_callback_impl.cpp files are refactored accordingly so that torch script call and builtin call can share same message type and codes.

5. refactored deserializeResponse() and added a new utility to deserizalize response to IValue

ghstack-source-id: 96879167
ghstack-source-id: 96879167

Test Plan: unit test

Differential Revision: D19402374

fbshipit-source-id: 04efcc7c167d08a6503f29efe55e76f2be4b2c5e
2020-01-18 09:24:17 -08:00
1ecad2bb2b Test passing custom class instance to bound method
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/32320

Test Plan: Imported from OSS

Differential Revision: D19437335

Pulled By: jamesr66a

fbshipit-source-id: 8f5166dbe6fc5704b12b6224932460b12be0d39b
2020-01-17 23:09:38 -08:00
c7078a1ce8 Fix returning instance of custom class from method
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/32312

Test Plan: Imported from OSS

Differential Revision: D19433511

Pulled By: jamesr66a

fbshipit-source-id: f048d5f60eaba992ee42fea2d318a59b3a156578
2020-01-17 23:09:34 -08:00
c7fdf5b251 Remove __torch__ from custom class qualname
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/32301

Test Plan: Imported from OSS

Differential Revision: D19431645

Pulled By: jamesr66a

fbshipit-source-id: 198522a1641cb9f90fa4c614da4ca4162fadf456
2020-01-17 23:09:29 -08:00
ceffdbd217 Temporary workaround for BC test due to schema parser changes
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/32324

Test Plan: Imported from OSS

Differential Revision: D19438085

Pulled By: jamesr66a

fbshipit-source-id: 3dd2586e73c890a7bdadd6cbb3df2c186f93199d
2020-01-17 23:08:20 -08:00
61ee8c972f porting scatter_add to ATen (CPU) (#31662)
Summary:
Fixes [https://github.com/pytorch/pytorch/issues/24758](https://github.com/pytorch/pytorch/issues/24758).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31662

Differential Revision: D19440824

Pulled By: ngimel

fbshipit-source-id: b13443cfcc8bcb9ec21f1cddb5c6fbc0ef4bb0f2
2020-01-17 21:36:54 -08:00
53429680d5 Remove stray @script (#32235)
Summary:
This should be covered under recursive script now
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32235

Pulled By: driazati

Differential Revision: D19414889

fbshipit-source-id: 85f8132401dbe44c9dbaef7c0350110f90eb9843
2020-01-17 19:22:09 -08:00
8c40a78277 Back out "Calling JITed 8 Bit Fused SLS in FBGEMM from C2" (#32381)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32381
Original commit changeset: 0dfa936eb503

"Facebook"
Temporary remedy for SEV :
https://our.intern.facebook.com/intern/sevmanager/view/s/193726

Test Plan: Run CI tests

Reviewed By: jspark1105

Differential Revision: D19458382

fbshipit-source-id: 731790f96b341ade5e70ff13e4b0b5fafad0fea6
2020-01-17 19:08:48 -08:00
25e62ebac9 Updating submodules
Summary:
GitHub commits:

9b13f58aa1
044b292acc
e1f67bbf3d

Test Plan: n/a

Reviewed By: yns88

fbshipit-source-id: 21df26f60f436eb8c1766f66afac4a0d93dd33d1
2020-01-17 18:32:53 -08:00
10c2bd35af Fix cudnn channels_last descriptors problem (#31952)
Summary:
This is to append fixes to https://github.com/pytorch/pytorch/issues/31783 so we can pull the fixes in without breaking tests.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31952

Differential Revision: D19433839

Pulled By: ngimel

fbshipit-source-id: 5b3d2f0b2a86aacd1d100dd86996ee0d63e5ee92
2020-01-17 17:45:07 -08:00
824e649d40 Specify requires_grad for Parameter replica so it's not always set to True by default (#32356)
Summary:
This is the proposed fix for issue https://github.com/pytorch/pytorch/issues/32018
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32356

Differential Revision: D19450648

Pulled By: mrshenli

fbshipit-source-id: c63eeb6e9f5a87ebe613dd7013907559f295a7ea
2020-01-17 17:41:10 -08:00
0ac31a99be run code analysis against mobile interpreter (#32276)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32276

Include mobile interpreter in mobile code analysis pass, which has some
manually registered ops in temporary namespaces.

The mobile interpreter is still under development and these ops will be
removed in the future. This is a temporary step for internal build
experiment.

Test Plan: Imported from OSS

Differential Revision: D19426818

Pulled By: ljk53

fbshipit-source-id: 507453dc801e5f93208f1baea12400beccda9ca5
2020-01-17 17:21:28 -08:00
5bc44fb6ea TensorIterator unrolling and vectorized load - step 0, 1 (#31974)
Summary:
This is step 0 and 1 for  https://github.com/pytorch/pytorch/issues/31975:

- Old code is moved to namespace `legacy`
- New `elementwise_kernel` and `launch_kernel` added to namespace `modern`, they only support 1d contiguous case for now
- In `gpu_kernel_impl`, dispatch to the new code if the problem is trivial 1d contiguous.

In terms of performance, this PR affect elementwise operators on contiguous tensors. The performance is improved slightly (up to 8%) for medium size tensors on Volta.

## compiled code
See https://github.com/zasdfgbnm/things/blob/master/2020Q1/disassembly-elementwise.ipynb

We can see that, previously, the add kernel compiles to
```
	//## File "/home/xgao/pytorch-master/aten/src/ATen/native/cuda/Loops.cuh", line 71
        /*0000*/                   IMAD.MOV.U32 R1, RZ, RZ, c[0x0][0x28] ;
        /*0010*/              @!PT SHFL.IDX PT, RZ, RZ, RZ, RZ ;
        /*0020*/                   S2R R0, SR_TID.X ;
	//## File "/home/xgao/pytorch-master/aten/src/ATen/native/cuda/Loops.cuh", line 73
        /*0030*/                   S2R R3, SR_CTAID.X ;
        /*0040*/                   IMAD R0, R3, 0x200, R0 ;
	//## File "/home/xgao/pytorch-master/aten/src/ATen/native/cuda/Loops.cuh", line 76
        /*0050*/                   ISETP.GE.AND P0, PT, R0, c[0x0][0x160], PT ;
        /*0060*/               P0 EXIT ;
	//## File "/home/xgao/pytorch-master/aten/src/ATen/native/cuda/Loops.cuh", line 110
        /*0070*/                   IMAD R3, R0.reuse, c[0x0][0x194], RZ ;
        /*0080*/                   IMAD R6, R0, c[0x0][0x198], RZ ;
        /*0090*/                   IADD3 R4, P0, R3.reuse, c[0x0][0x178], RZ ;
        /*00a0*/                   IADD3 R2, P1, R6.reuse, c[0x0][0x180], RZ ;
        /*00b0*/                   LEA.HI.X.SX32 R5, R3, c[0x0][0x17c], 0x1, P0 ;
        /*00c0*/                   LEA.HI.X.SX32 R3, R6, c[0x0][0x184], 0x1, P1 ;
        /*00d0*/                   LDG.E.SYS R5, [R4] ;
        /*00e0*/                   LDG.E.SYS R2, [R2] ;
	//## File "/home/xgao/pytorch-master/aten/src/ATen/native/cuda/Loops.cuh", line 77
        /*00f0*/                   IMAD R0, R0, c[0x0][0x190], RZ ;
        /*0100*/                   IADD3 R6, P0, R0, c[0x0][0x170], RZ ;
        /*0110*/                   LEA.HI.X.SX32 R7, R0, c[0x0][0x174], 0x1, P0 ;
	//## File "/home/xgao/pytorch-master/aten/src/ATen/native/cuda/Loops.cuh", line 110
        /*0120*/                   FFMA R9, R2, c[0x0][0x1a0], R5 ;
	//## File "/home/xgao/pytorch-master/aten/src/ATen/native/cuda/Loops.cuh", line 170
        /*0130*/                   STG.E.SYS [R6], R9 ;
	//## File "/home/xgao/pytorch-master/aten/src/ATen/native/cuda/Loops.cuh", line 81
        /*0140*/                   EXIT ;
.L_16826:
        /*0150*/                   BRA `(.L_16826);
        /*0160*/                   NOP;
        /*0170*/                   NOP;
.L_29063:
```
Now it compiles to
```
	//## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/Loops.cuh", line 210
        /*0000*/                   MOV R1, c[0x0][0x28] ;
        /*0010*/              @!PT SHFL.IDX PT, RZ, RZ, RZ, RZ ;
        /*0020*/                   S2R R6, SR_CTAID.X ;
	//## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/Loops.cuh", line 217
        /*0030*/                   MOV R7, 0x4 ;
	//## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/Loops.cuh", line 208
        /*0040*/                   S2R R3, SR_TID.X ;
	//## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/Loops.cuh", line 210
        /*0050*/                   LEA R6, R6, R3, 0x8 ;
	//## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/Loops.cuh", line 225
        /*0060*/                   IADD3 R2, R6.reuse, 0x40, RZ ;
	//## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/Loops.cuh", line 217
        /*0070*/                   IMAD.WIDE R4, R6.reuse, R7.reuse, c[0x0][0x190] ;
	//## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/Loops.cuh", line 225
        /*0080*/                   IADD3 R3, R6, 0x80, RZ ;
        /*0090*/                   ISETP.GE.AND P1, PT, R2, c[0x0][0x160], PT ;
        /*00a0*/                   ISETP.GE.AND P0, PT, R6.reuse, c[0x0][0x160], PT ;
        /*00b0*/                   ISETP.GE.AND P2, PT, R3, c[0x0][0x160], PT ;
	//## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/Loops.cuh", line 217
        /*00c0*/                   IMAD.WIDE R2, R6.reuse, R7, c[0x0][0x188] ;
	//## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/Loops.cuh", line 225
        /*00d0*/                   IADD3 R14, R6, 0xc0, RZ ;
        /*00e0*/                   ISETP.GE.AND P3, PT, R14, c[0x0][0x160], PT ;
	//## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/Loops.cuh", line 228
        /*00f0*/              @!P1 LDG.E.SYS R11, [R4+0x100] ;
        /*0100*/              @!P0 LDG.E.SYS R0, [R2] ;
        /*0110*/              @!P0 LDG.E.SYS R9, [R4] ;
        /*0120*/              @!P1 LDG.E.SYS R8, [R2+0x100] ;
        /*0130*/              @!P2 LDG.E.SYS R10, [R2+0x200] ;
        /*0140*/              @!P2 LDG.E.SYS R13, [R4+0x200] ;
        /*0150*/              @!P3 LDG.E.SYS R12, [R2+0x300] ;
        /*0160*/              @!P3 LDG.E.SYS R15, [R4+0x300] ;
	//## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/Loops.cuh", line 245
        /*0170*/                   IMAD.WIDE R6, R6, R7, c[0x0][0x180] ;
	//## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/Loops.cuh", line 191
        /*0180*/                   FFMA R9, R9, c[0x0][0x168], R0 ;
        /*0190*/                   FFMA R11, R11, c[0x0][0x168], R8 ;
        /*01a0*/                   FFMA R13, R13, c[0x0][0x168], R10 ;
	//## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/Loops.cuh", line 245
        /*01b0*/              @!P0 STG.E.SYS [R6], R9 ;
        /*01c0*/              @!P1 STG.E.SYS [R6+0x100], R11 ;
        /*01d0*/              @!P2 STG.E.SYS [R6+0x200], R13 ;
	//## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/Loops.cuh", line 191
        /*01e0*/                   FFMA R15, R15, c[0x0][0x168], R12 ;
	//## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/Loops.cuh", line 244
        /*01f0*/               P3 EXIT ;
	//## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/Loops.cuh", line 245
        /*0200*/                   STG.E.SYS [R6+0x300], R15 ;
	//## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/Loops.cuh", line 248
        /*0210*/                   EXIT ;
.L_727:
        /*0220*/                   BRA `(.L_727);
        /*0230*/                   NOP;
        /*0240*/                   NOP;
        /*0250*/                   NOP;
        /*0260*/                   NOP;
        /*0270*/                   NOP;
.L_32233:
```

## benchmark

The benchmark is for add kernel on Volta.

See https://github.com/zasdfgbnm/things/blob/master/2020Q1/benchmark-unroll.ipynb

For tensors of size from 2^20 to 2^30, previously we had
```
1.5.0a0+dedd16b
dedd16b4181cae81e37e978cd3bf24c1ba35ca05
33 µs ± 31.8 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
48.7 µs ± 75 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
78.9 µs ± 122 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
140 µs ± 51.8 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
261 µs ± 71.4 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
506 µs ± 159 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
993 µs ± 189 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
1.96 ms ± 139 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
3.9 ms ± 955 ns per loop (mean ± std. dev. of 7 runs, 100 loops each)
7.79 ms ± 187 ns per loop (mean ± std. dev. of 7 runs, 100 loops each)
```
Now we have
```
1.5.0a0+b1a239b
b1a239be8d529e89875fe47cd09964ef3a9516ac
30.4 µs ± 18 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
45.2 µs ± 46.5 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
75 µs ± 476 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
134 µs ± 192 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
253 µs ± 354 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
489 µs ± 138 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
961 µs ± 431 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
1.91 ms ± 578 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
3.8 ms ± 88.8 ns per loop (mean ± std. dev. of 7 runs, 100 loops each)
7.57 ms ± 763 ns per loop (mean ± std. dev. of 7 runs, 100 loops each)
```
It is slightly better.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31974

Differential Revision: D19450765

Pulled By: ngimel

fbshipit-source-id: 79601bfceb5da84ff87384ba8193793eb4095a2e
2020-01-17 17:16:23 -08:00
f326045b37 Fix typos, via a Levenshtein-type corrector (#31523)
Summary:
Should be non-semantic.

Uses https://en.wikipedia.org/wiki/Wikipedia:Lists_of_common_misspellings/For_machines to find likely typos, with https://github.com/bwignall/typochecker to help automate the checking.

Uses an updated version of the tool used in https://github.com/pytorch/pytorch/pull/30606 .
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31523

Differential Revision: D19216749

Pulled By: mrshenli

fbshipit-source-id: 7fd489cb9a77cd7e4950c1046f925d57524960ea
2020-01-17 16:03:19 -08:00
c8ca70e39d Updating submodules
Summary:
GitHub commits:

54b290f00f
e8df50310d
ef5c9efe12

Test Plan: n/a

Reviewed By: yns88

fbshipit-source-id: 7b6dc88d40e8fd8c396d4d12846db43b0fb4258c
2020-01-17 15:48:29 -08:00
7e3c438913 Renaming IValue List functions (#32093)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32093

toGenericListRef -> toListRef
isGenericList -> isList
toGenericList -> toList
toXListRef -> toXVector

Test Plan: Imported from OSS

Reviewed By: suo

Differential Revision: D19369767

Pulled By: zdevito

fbshipit-source-id: 4f0078f95b83e6586524c03f7bcf206722fdd9ae
2020-01-17 15:17:45 -08:00
bdd5e15437 skip testExceptions in ProcessGroupGloo if built with TSAN (#32242)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32242

TSAN and fork don't play well together, so skip this test if we're
building under TSAN. It will still run in other modes.

Differential Revision: D19416113

fbshipit-source-id: 7e88d63a843356372160c2524c05e8fd1706553e
2020-01-17 14:17:06 -08:00
5a58c16722 Updating submodules
Summary:
GitHub commits:

29aba0a287
37a97eb4de
0efdd57292
6d886fc7eb
2e5854752a
931d1c643b
781986ef71
2e6d2903d7
e04348ff63
e8650fd560

Test Plan: n/a

Reviewed By: yns88

fbshipit-source-id: abd7ee4aaec8401b2c885335940773a0655b4496
2020-01-17 12:48:36 -08:00
9b6ec61bfd exposing CPU/GPU Copy ops (#32248)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32248

expose CPU/GPU copy ops

Test Plan: buck test mode/dev-nosan caffe2/caffe2/python/operator_test:torch_integration_test

Reviewed By: houseroad

Differential Revision: D19405856

fbshipit-source-id: 1df4aa202e26647cb81e9fe7e4478e594a5f7f3e
2020-01-17 12:40:43 -08:00
e7bc1663bd fix unchecked cast alias analysis (#32309)
Summary:
Unchecked cast just refines the type of a value, the value stays the same, so the output should alias the input.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32309

Differential Revision: D19439037

Pulled By: eellison

fbshipit-source-id: fe6902d0d9a5a9ef5e9c13e1dbd056576d8c327e
2020-01-17 12:29:28 -08:00
df514fd8c0 C++ C2/Glow operator unittest
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/32258

Test Plan:
```
 buck test glow/fb/test/numerics:fp16_op_test
```

Reviewed By: bddppq

Differential Revision: D19401786

fbshipit-source-id: 1382b5208be6172d3e6f768dedad7ebec31cffc9
2020-01-17 12:13:34 -08:00
e133d8be3b Fix ASAN / potential segfault in quantized Tensor memory allocations.
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/29882

Differential Revision: D18522039

Pulled By: AshkanAliabadi

fbshipit-source-id: 1fdc68491aa2ac176633b9ecc3ee78c9175a97aa
2020-01-17 12:09:25 -08:00
4e69352713 Add 64bit atomic fetch add (#32354)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32354

adding int_64 version of AtomicFetchAdd

Reviewed By: bwasti

Differential Revision: D19434349

fbshipit-source-id: b2358e8c5c6b7cd7e7b21de974b4ee1b5258fcf4
2020-01-17 11:43:43 -08:00
aa61d1ee85 Add a new job to support custom build (#32323)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32323

### Summary

Since we have released the custom build in 1.4.0, it's time to setup a CI for that. This PR adds a new iOS job to the iOS builds. To save time, It only runs the arm64 build.

### Test Plan

- Don't break any iOS jobs
- Custom Build works.

Test Plan: Imported from OSS

Differential Revision: D19451342

Pulled By: xta0

fbshipit-source-id: 9de305c004fc795710ecf01d436ef4792c07760c
2020-01-17 11:39:08 -08:00
7732924501 Delete unused bernoulli_Tensor from THTensorRandom.h
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/32328

Test Plan: Imported from OSS

Differential Revision: D19448736

Pulled By: pbelevich

fbshipit-source-id: 92380ca1e0c0ac88d100e6fba8d216a46d0b181e
2020-01-17 11:09:19 -08:00
8c1268aad3 Use default scale/zero_point in fake_quantize module instead of None (#32318)
Summary:
Distributed data parallel can not broadcast None so when we prepare the model for QAT and trying to save the model it will error out.
fixes: https://github.com/pytorch/pytorch/issues/32082
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32318

Differential Revision: D19434801

Pulled By: jerryzh168

fbshipit-source-id: ee70abe4c3dcdd3506fb7dd0316aee2fb1705469
2020-01-17 11:04:08 -08:00
5b815d980e Added cummin
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/32238

Differential Revision: D19416791

Pulled By: anjali411

fbshipit-source-id: 5aadc0a7a55af40d76f444ab7d7d47ec822f55a5
2020-01-17 10:51:58 -08:00
78d8f691ad Don't dispatch to integral types in smooth_l1_kernel
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/32333

Differential Revision: D19442787

Pulled By: ngimel

fbshipit-source-id: 9578483202614d7406eceb13cbf15b253c04f237
2020-01-17 10:47:43 -08:00
6a5a55d573 use gtest asserts in ProcessGroupGlooTest instead of other checks (#32138)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32138

I personally prefer `throw std::runtime_error("BOOM")`, but we should
probably have asserts here now that it is gtest. Also ensures that the correct
exceptions are thrown by the `testSignal` tests.
ghstack-source-id: 96811000

Differential Revision: D19382905

fbshipit-source-id: 1b00dd70524d03c8bd6f48715baa5070a7985467
2020-01-17 10:31:59 -08:00
4968bc2450 cap the maximum depth of bailout chains at 1 (#32073)
Summary:
This is another implementation of the maximum bailout depth.
The first version was implemented in https://github.com/pytorch/pytorch/pull/31521
This one has advantages that
* the bailout depth only exists in `CodeImpl` which seems to be an appropriate place to keep it in.
* threading many objects is reduced to threading through CodeImpl and getPlanFor
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32073

Differential Revision: D19443432

Pulled By: Krovatkin

fbshipit-source-id: 898384bb2308a1532a50a33d9e05cfca504711e6
2020-01-17 09:42:46 -08:00
61a2b34113 Updating submodules
Summary:
GitHub commits:

2d9c2bb401

Test Plan: n/a

Reviewed By: yns88

fbshipit-source-id: ea12c419c4bab8ce60793deecb10a8ead086a4d5
2020-01-17 05:54:26 -08:00
904ab092c2 fix testSend and testRecv in ProcessGroupGlooTest (#32134)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32134

These tests weren't written in the most correct way and were often
flaky. It was tricky to identify these tests as flaky until we moved this file
to use gtest.

The gist of the issue is that the test previously would not coordinate sends
and recvs properly. For example, we created a single thread to test an
abortRecv and a successful recv. A separate sender thread was used to send 2
messages. What could go wrong here is that the first send could successfully
complete, resulting in the receiving end processing the message before it gets
the abort signal. In this case we would have an error in the test.
ghstack-source-id: 96806879

Differential Revision: D19379395

fbshipit-source-id: 24782ccaf6e6ec6b445378b29d5f10f901e0dee6
2020-01-17 04:00:39 -08:00
7a9c920bac add lock for ncclCommAbort (#31901)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31901

ncclCommAbort is not thread safe, so adding a lock for it
ghstack-source-id: 96829715

Test Plan: unit tests

Differential Revision: D19293869

fbshipit-source-id: 711b4a07605d6e5a81577247d2f90a78041c1809
2020-01-17 03:57:08 -08:00
91bdb872ce fix spelling mistake: excpected -> expected
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/28817

Differential Revision: D18544562

Pulled By: dgisser

fbshipit-source-id: 51f728e807f9c4bb30f58585d5b6f436cb880153
2020-01-17 00:11:08 -08:00
ef5ae4823a Register RoIAlignRotated with C10
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/30785

Reviewed By: wat3rBro

Differential Revision: D18415056

fbshipit-source-id: e00376bec948309d53f2172697cd477449f769b2
2020-01-16 16:32:28 -08:00
b79030d6c8 remove unused code after refactoring optimizations into profiling-sensitive and profiling-insensitive (#32106)
Summary:
After we removed `Specialize_AutogradZero` from the optimization pipeline of the simple executor mode, we don't need to mark any inputs as undefined in `autodiff`. Also, `needsGradient` in `graph_executor.cpp` never runs on graph with profiling information, so I removed that code as well.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32106

Differential Revision: D19374238

Pulled By: Krovatkin

fbshipit-source-id: 4223d3efe3c904a55a28471e5ae9593017ce3e07
2020-01-16 16:31:16 -08:00
c2761490fc Enhancing the test (#32321)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32321

Updating the test to test more meaningful sematics

Test Plan:
[xintchen@devvm6308.prn2 ~/fbsource/fbcode] buck test mode/dev //caffe2:ATen-core-test -- 'OperatorRegistrationTest\.whenRegisteringCPUTensorType_thenCanOnlyCallUnboxedWithCPUTensorIdDispatchKey'
Building: finished in 0.4 sec (100%) 517/517 jobs, 0 updated
  Total time: 0.5 sec
Trace available for this run at /tmp/testpilot.20200116-132729.2541763.log
TestPilot test runner for Facebook. See https://fburl.com/testpilot for details.
Testpilot build revision e5f315ebe0508d11fc281fa4b4f7b43d2ef1c003 fbpkg 67e8eb96914f400db234fd9af70fdcde at Wed Jan 15 23:38:32 2020 by twsvcscm from /usr/local/fbprojects/packages/testinfra.testpilot/762/t.par
Discovering tests
Running 1 tests
Started new test run: https://our.intern.facebook.com/intern/testinfra/testrun/6192449492430045
      ✓ caffe2:ATen-core-test - OperatorRegistrationTest.whenRegisteringCPUTensorType_thenCanOnlyCallUnboxedWithCPUTensorIdDispatchKey 0.002 1/1 (passed)
Finished test run: https://our.intern.facebook.com/intern/testinfra/testrun/6192449492430045
Summary (total time 1.15s):
  PASS: 1
  FAIL: 0
  SKIP: 0
  FATAL: 0
  TIMEOUT: 0
  OMIT: 0

Differential Revision: D19436345

fbshipit-source-id: c1f2383d62627aa4507616b8905ceb42ac563e9d
2020-01-16 15:56:34 -08:00
53708e21ed classic fixed-point liveness
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/31724

Differential Revision: D19426570

Pulled By: Krovatkin

fbshipit-source-id: 3387dfb25e6e9456d5d0517eac1d2e44e61d6813
2020-01-16 15:13:22 -08:00
8c8bd79f32 Add CI scripts for Custom Build (#32316)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32316

### Summary

Since the Custom Build has been released in 1.4.0, it's time setup CI. To do that, we need

1.  Add a python script to generate the yaml file
2. Add new build scripts to circle CI (arm64 only).

### Test Plan

- Don't break the current iOS CIs

Test Plan: Imported from OSS

Differential Revision: D19437362

Pulled By: xta0

fbshipit-source-id: 395e27a582c43663af88d11b1ef974a4687e672c
2020-01-16 14:46:16 -08:00
34c751c263 Eliminate exception throwing code from dispatch call sites (#32168)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32168

We move the exception raising into the function, saving us a
big pile of instructions for raising the stack.

After this stack of changes, the compiler is willing to inline, e.g.,
`c10::KernelFunction::callUnboxed<at::Tensor, at::Tensor const&>(c10::OperatorHandle const&, at::Tensor const&) const::__func__`
(whereas previously it refused to do so.)

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Differential Revision: D19392948

Pulled By: ezyang

fbshipit-source-id: d5edab00cae48444b308e74438a17a421532c08f
2020-01-16 14:43:16 -08:00
b85dbe8f7b Out-of-line construction of OperatorName. (#32121)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32121

This reduces code size in the call sites of this function (of which
there are many: one for every operator call) since we no longer have
to construct std::string at the site.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Differential Revision: D19392951

Pulled By: ezyang

fbshipit-source-id: 8bc43d46ba635380ff9f8989f7557fdd74b552cf
2020-01-16 14:43:12 -08:00
36d09197ab Move error reporting code out-of-line from header. (#32118)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32118

This reduces code size and makes the calling function more likely to inline.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Differential Revision: D19392950

Pulled By: ezyang

fbshipit-source-id: 5e3829cca5604407229f93c2486eb9a325581ea2
2020-01-16 14:43:07 -08:00
7b7390778c Make an assert on a hotpath trigger only in DEBUG mode. (#32117)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32117

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Differential Revision: D19392949

Pulled By: ezyang

fbshipit-source-id: 7f579e45d49bddeab36b8dd1a90c83224a368ac8
2020-01-16 14:42:18 -08:00
8746f90cf6 Fix weight backward for cudnn conv of large tensor (#31889)
Summary:
This is the last PR for https://github.com/pytorch/pytorch/issues/22496
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31889

Differential Revision: D19431371

Pulled By: ngimel

fbshipit-source-id: 754fa91d49ad03549cb07aa30dde34bf9e851302
2020-01-16 14:15:52 -08:00
b26ee54176 For ppc64le, stop presenting the python 2.7 builds (we will no longer… (#32315)
Summary:
For ppc64le, we no longer plan to run regular builds on Python 2.7, and we wish to stop
publicizing the build status for those two builds (ppc64le/CPU and ppc64le/GPU each on py27).

This pull request simply removes the build status links for these two builds, replacing them
with a generic dash character (consistent with other un-publicized builds within the table).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32315

Differential Revision: D19435939

Pulled By: soumith

fbshipit-source-id: c9f31e7acba83e42f6a758ac011bbef36fd8aaa0
2020-01-16 13:49:40 -08:00
cd99b3706a Pin Pillow to latest and use a torchvision that works with it (#32290)
Summary:
Follow on from https://github.com/pytorch/pytorch/pull/31777, as suggested in https://github.com/pytorch/pytorch/pull/31777#issuecomment-575166543.

Pillow 7.0.0 removed `PILLOW_VERSION` and `__version__` should be used instead.

torchvision 0.5.0 switched from using `PILLOW_VERSION` to `__version__`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32290

Differential Revision: D19430280

Pulled By: mrshenli

fbshipit-source-id: be8d6317a4948d71e818adeafe61dfe567df5601
2020-01-16 10:48:22 -08:00
f94aab45fd Logical condition reduction (#32201)
Summary:
x || ( !x  &&  y )  <=>  to x || y
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32201

Differential Revision: D19429334

Pulled By: ezyang

fbshipit-source-id: 044dc46c2d9a7e180aa1795703c0097b0c7c3585
2020-01-16 07:57:12 -08:00
14548c2d5b out variant for native_batch_norm forward (#29192)
Summary:
This is dealing with forward of native BatchNorm CUDA impl to support inplace operation. The larger issue: https://github.com/pytorch/pytorch/issues/26288

ezyang
Pull Request resolved: https://github.com/pytorch/pytorch/pull/29192

Differential Revision: D19410370

Pulled By: ezyang

fbshipit-source-id: a6889c96bdd848f3a1cb2d943d06e054d22fb7ab
2020-01-16 07:24:13 -08:00
bab87e4b60 reimplement __torch_function__ overrides for torch.functional using inline logic (#32194)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/30831.

This improves the performance of operators in the `torch.functional` namespace that are overridable by `__torch_function__` implementations when supplied with `Tensor` operands.

Running the split benchmark in various configurations produces the following timings:

<details>
<summary>Expand for timings on <code>master</code> </summary>

```
# ----------------------------------------
# PyTorch/Caffe2 Operator Micro-benchmarks
# ----------------------------------------
# Tag : short

# Benchmarking PyTorch: split
# Mode: Eager
# Name: split_M8_N8_parts2_cpu
# Input: M: 8, N: 8, parts: 2, device: cpu
Forward Execution Time (us) : 3.340

# Benchmarking PyTorch: split
# Mode: Eager
# Name: split_M8_N8_parts2_cuda
# Input: M: 8, N: 8, parts: 2, device: cuda
Forward Execution Time (us) : 3.333

# Benchmarking PyTorch: split
# Mode: Eager
# Name: split_M256_N512_parts2_cpu
# Input: M: 256, N: 512, parts: 2, device: cpu
Forward Execution Time (us) : 3.366

# Benchmarking PyTorch: split
# Mode: Eager
# Name: split_M256_N512_parts2_cuda
# Input: M: 256, N: 512, parts: 2, device: cuda
Forward Execution Time (us) : 3.385

# Benchmarking PyTorch: split
# Mode: Eager
# Name: split_M512_N512_parts2_cpu
# Input: M: 512, N: 512, parts: 2, device: cpu
Forward Execution Time (us) : 3.468

# Benchmarking PyTorch: split
# Mode: Eager
# Name: split_M512_N512_parts2_cuda
# Input: M: 512, N: 512, parts: 2, device: cuda
Forward Execution Time (us) : 3.416
```
</details>

<details>
<summary>Expand for timings with this pull request applied</summary>

```
# ----------------------------------------
# PyTorch/Caffe2 Operator Micro-benchmarks
# ----------------------------------------
# Tag : short

# Benchmarking PyTorch: split
# Mode: Eager
# Name: split_M8_N8_parts2_cpu
# Input: M: 8, N: 8, parts: 2, device: cpu
Forward Execution Time (us) : 2.261

# Benchmarking PyTorch: split
# Mode: Eager
# Name: split_M8_N8_parts2_cuda
# Input: M: 8, N: 8, parts: 2, device: cuda
Forward Execution Time (us) : 2.223

# Benchmarking PyTorch: split
# Mode: Eager
# Name: split_M256_N512_parts2_cpu
# Input: M: 256, N: 512, parts: 2, device: cpu
Forward Execution Time (us) : 2.237

# Benchmarking PyTorch: split
# Mode: Eager
# Name: split_M256_N512_parts2_cuda
# Input: M: 256, N: 512, parts: 2, device: cuda
Forward Execution Time (us) : 2.218

# Benchmarking PyTorch: split
# Mode: Eager
# Name: split_M512_N512_parts2_cpu
# Input: M: 512, N: 512, parts: 2, device: cpu
Forward Execution Time (us) : 2.259

# Benchmarking PyTorch: split
# Mode: Eager
# Name: split_M512_N512_parts2_cuda
# Input: M: 512, N: 512, parts: 2, device: cuda
Forward Execution Time (us) : 2.234
```

</details>

<details>
<summary>Expand for timings on <code>master</code> with <code>__torch_function__</code> dispatch disabled </summary>

```
# ----------------------------------------
# PyTorch/Caffe2 Operator Micro-benchmarks
# ----------------------------------------
# Tag : short

# Benchmarking PyTorch: split
# Mode: Eager
# Name: split_M8_N8_parts2_cpu
# Input: M: 8, N: 8, parts: 2, device: cpu
Forward Execution Time (us) : 2.180

# Benchmarking PyTorch: split
# Mode: Eager
# Name: split_M8_N8_parts2_cuda
# Input: M: 8, N: 8, parts: 2, device: cuda
Forward Execution Time (us) : 2.172

# Benchmarking PyTorch: split
# Mode: Eager
# Name: split_M256_N512_parts2_cpu
# Input: M: 256, N: 512, parts: 2, device: cpu
Forward Execution Time (us) : 2.171

# Benchmarking PyTorch: split
# Mode: Eager
# Name: split_M256_N512_parts2_cuda
# Input: M: 256, N: 512, parts: 2, device: cuda
Forward Execution Time (us) : 2.146

# Benchmarking PyTorch: split
# Mode: Eager
# Name: split_M512_N512_parts2_cpu
# Input: M: 512, N: 512, parts: 2, device: cpu
Forward Execution Time (us) : 2.175

# Benchmarking PyTorch: split
# Mode: Eager
# Name: split_M512_N512_parts2_cuda
# Input: M: 512, N: 512, parts: 2, device: cuda
Forward Execution Time (us) : 2.152
```

</details>

So at least on the machine I'm testing on, this brings the overhead down to less than 100 ns. For comparison, the overhead for `__array_function__` in NumPy is about 850 ns on the same machine.

<details>
<summary>Expand for timings for NumPy <code>__array_function__</code> dispatch </summary>

```
In [1]: import numpy as np

In [2]: %timeit np.mean([1])
8.89 µs ± 17.6 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

In [3]: %timeit np.mean._implementation([1])
8.04 µs ± 28.2 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
```

See [the implementation in NumPy](https://github.com/numpy/numpy/blob/master/numpy/core/overrides.py#L195) for why this measures `__array_function__` overhead.

</details>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32194

Differential Revision: D19410396

Pulled By: ezyang

fbshipit-source-id: ada788a4399c81cd7eb2d548aa04a2459e96634a
2020-01-16 07:10:38 -08:00
7df5dc2775 Creating callUnboxedWithDispatchKey method (#32198)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32198

creating a method called "callUnboxedWithDispatchKey".

Also adding tests to make sure it works.

Test Plan: buck test mode/dev //caffe2:ATen-core-test

Differential Revision: D19402815

fbshipit-source-id: b206cf04b1216fbbd5b54ac79aef495cb0c1be06
2020-01-16 01:37:41 -08:00
d75b6b3f9d Support shape inference and lowering of SparseLengthsWeightedSumFused4BitRowwise (#32257)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32257

Pull Request resolved: https://github.com/pytorch/glow/pull/4018

att.

Test Plan:
Unit tests:
```
buck test glow:masterCaffe2ImporterTest -- caffe2.SparseLengthsSumFused4BitRowwise
buck test caffe2/caffe2/opt:bound_shape_inference_test
```

Reviewed By: jfix71

Differential Revision: D19389014

fbshipit-source-id: 5f6863443adee5d3bf7a50a105866441eefb9560
2020-01-15 23:49:06 -08:00
f3b62d4b1c Updating submodules
Summary:
GitHub commits:

191bbb1069
9d5a6e33e3
2bdfe1544a
1600bee8de
b7f1b3e51c
3220376f13
1ba747dfb4
0d5b08cbfc
481179a38e
9bc4f9c40f

Test Plan: n/a

Reviewed By: yns88

fbshipit-source-id: 79135519c3449c2b77ff1ca7d4f13724e2390f6e
2020-01-15 21:37:32 -08:00
851a7e861b Add CAFFE2_API to video decoding functions (#31187)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/31132
Also closes old issue https://github.com/pytorch/pytorch/issues/11735
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31187

Differential Revision: D19147172

Pulled By: pbelevich

fbshipit-source-id: e959058eec3489061f431fbecc99ded0d4dc1704
2020-01-15 19:39:02 -08:00
89c6e18c43 Updating submodules
Summary:
GitHub commits:

9915834ced
3cdb0d61d6
93a4e9f4cc
dafd450683
b5d5670e40
bab52dcc84
d2b4d42d4b
83479196c3
f2ec66095a
99561fee3b
eacaa4f35d
4ce4667b20
89291814cc

Test Plan: n/a

Reviewed By: yns88

fbshipit-source-id: 2a3c90f0a7615441dae746b18b9048cfddf0f4de
2020-01-15 17:54:21 -08:00
90c65b81c3 Define repr() on IValues (#32232)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32232

Previously, we were using `operator<<` as the default way of printing
IValue constants during serialization. The semantics of `operator<<`
were ill-defined; and this bit us in particular with strings and lack of
quoting.

This PR defines the role of `operator<<`: much like Python `str()`, it
is intended to produce a human-readable-ish representation for
debugging purposes.

This PR also defines a new `repr()` function on IValue that is intended
to produce a valid Python expression that can be used to recreate an
object with the same value. `repr()` is not defined on all IValue kinds
(notably tensors!) for this reason.

Test Plan: Imported from OSS

Differential Revision: D19417036

Pulled By: suo

fbshipit-source-id: c102d509eaf95a28b6a62280bc99ca6f09603de5
2020-01-15 17:35:41 -08:00
104b2c610b Tensor prep from image in native (#31426)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31426

Tensor convertion from YUV image is moved to native with optimizations to eliminate branching inside loop, no variables declaration, less ops.

Perf stat from local devices - measuring converting 320x240 image from camera to 1,3,224,224 tensor;
Legend:
Java - current java impl
JavaOpt - current java impl + the same optimizations with no if/else in for, declare variables outside of for, inlining etc.
C - C impl

```
Nexus 5
JavaOpt N:25 avg:119.24 min: 87 max:177 p10:102 p25:105 p50:115 p75:127 p90:150
      C N:25 avg: 17.24 min: 14 max: 39 p10: 14 p25: 15 p50: 15 p75: 16 p90: 23
   Java N:25 avg:139.96 min: 70 max:214 p10: 89 p25:110 p50:139 p75:173 p90:181
avg C vs JavaOpt 6.91x

Pixel 3 XL
JavaOpt N:19 avg: 16.11 min: 12 max: 19 p10: 14 p25: 15 p50: 16 p75: 18 p90: 19
      C N:19 avg:  5.79 min:  3 max: 10 p10:  4 p25:  5 p50:  6 p75:  6 p90:  9
   Java N:19 avg: 16.21 min: 12 max: 20 p10: 14 p25: 15 p50: 16 p75: 18 p90: 20
avg C vs JavaOpt 2.78x

Full build with 4 abis inside:
Pixel 3 XL
JavaOpt N:25 avg: 18.84 min: 16 max: 24 p10: 16 p25: 17 p50: 18 p75: 20 p90: 22
      C N:25 avg:  7.96 min:  5 max: 10 p10:  7 p25:  7 p50:  8 p75:  9 p90:  9
avg C vs JavaOpt 2.36x
```

Test Plan: Imported from OSS

Differential Revision: D19165429

Pulled By: IvanKobzarev

fbshipit-source-id: 3b54e545f6fbecbc5bb43216aca81061e70bd369
2020-01-15 17:10:00 -08:00
de5821d291 Torchscript print to logcat (#31456)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31456

External request https://discuss.pytorch.org/t/jit-android-debugging-the-model/63950

By default torchscript print function goes to stdout. For android it is not seen in logcat by default.
This change propagates it to logcat.

Test Plan: Imported from OSS

Differential Revision: D19171405

Pulled By: IvanKobzarev

fbshipit-source-id: f9c88fa11d90bb386df9ed722ec9345fc6b25a34
2020-01-15 16:44:56 -08:00
31b7d0873c Add File existence checking (#32208)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32208

### Summary

Since the master branch will generate `libtorch_cpu.a`, which is different from the release branch. This PR will skip the missing libs before archiving them.

### Test Plan

- don't break the nightly build

Test Plan: Imported from OSS

Differential Revision: D19420042

Pulled By: xta0

fbshipit-source-id: fb28df17b7e95d5c7fdf5f3a21bece235d7be17c
2020-01-15 15:35:50 -08:00
8b4c695e47 Added cons folding for ONNX mul, div, sqrt ops (#32077)
Summary:
An example of a model with such leaf nodes is faster_rcnn model. This PR helps optimizing onnx ops.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32077

Reviewed By: hl475

Differential Revision: D19399622

Pulled By: houseroad

fbshipit-source-id: 35c628c6f1514b79f1bcf7982c25f0f4486f8941
2020-01-15 15:31:34 -08:00
ffc8e255c4 Sort export w/ negative axes (#31971)
Summary:
Fixing export of Sort on negative axes
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31971

Reviewed By: hl475

Differential Revision: D19325874

Pulled By: houseroad

fbshipit-source-id: 18ab2bf39221970c8ab65a1355f5759f88faa54f
2020-01-15 15:13:23 -08:00
4460a86cd6 Support op registration if name starts with underscore (_) (#32017)
Summary:
This is required for rehistering torchvision::_new_empty_tensor op
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32017

Reviewed By: hl475

Differential Revision: D19399606

Pulled By: houseroad

fbshipit-source-id: 43e1f2d78d2a0310af347b42f7e9b54cd503a20d
2020-01-15 14:57:57 -08:00
01010f5705 Add comments to torch::nn::ConvTranspose{1,2,3}d modules explaining how to use them in a Sequential module (#32223)
Summary:
Following changes in https://github.com/pytorch/pytorch/pull/31005.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32223

Differential Revision: D19415328

Pulled By: yf225

fbshipit-source-id: f6f74f10ba3b5cc7e1a92f8b02ea4c9747018ae8
2020-01-15 14:53:33 -08:00
a5161c7022 Update out-of-date comment on Docker image updates. (#32224)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32224

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Differential Revision: D19416878

Pulled By: ezyang

fbshipit-source-id: 0205d0635658a3328128dcaad94bbbef505342be
2020-01-15 14:30:58 -08:00
322f34b245 Adding DDP Design Note
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/32158

Test Plan: Imported from OSS

Differential Revision: D19405980

Pulled By: mrshenli

fbshipit-source-id: 808ef1c71b637546f8872375bf1828967b1a5a60
2020-01-15 14:10:45 -08:00
74621ca926 Add allgather_base as per our discussion re: ProcessGroup interface. (#31892)
Summary:
Introduce ProcessGroup::allgather_base. No implementation yet: plan to add it one PG backend at a time in a follow up.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31892

Test Plan: No functional changes, no tests yet.

Differential Revision: D19290739

Pulled By: agolynski

fbshipit-source-id: c2f4947d2980995724c539de7c6d97618e1ba11a
2020-01-15 14:05:23 -08:00
81048c41ab remove simple .data from torch/nn
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/31482

Test Plan: Imported from OSS

Differential Revision: D19303243

Pulled By: albanD

fbshipit-source-id: 5afdfeb4b8382c09b9ec65acd545148ed76d4285
2020-01-15 12:40:38 -08:00
3363ca20a7 example_outputs Doc Edit (#31826)
Summary:
torch.onnx.export docs contain two descriptions for 'example_outputs' arg.
So combined the information for it with the description with the parameters.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31826

Differential Revision: D19274928

Pulled By: zou3519

fbshipit-source-id: cbcce0a79c51784c1d7aa8981aab8aac118ca9b4
2020-01-15 12:34:34 -08:00
3d01e3d16f Notify other threads before running callbacks (#31713)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31713

- In case the callbacks are heavy/slow, the other threads should be able to start work on the value of the future after the current thread moves the value and unlock the mutex.
- `completed()` is not inlined. Avoid function call overhead.

ghstack-source-id: 96694593

Test Plan: tdb

Differential Revision: D5624371

fbshipit-source-id: 5762e6e894d20108ec9afedd1a6e64bcd97ee3fe
2020-01-15 12:03:07 -08:00
0392e8384b Fix simple typo: whos -> whose (#31288)
Summary:
Closes https://github.com/pytorch/pytorch/issues/31287
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31288

Differential Revision: D19166753

Pulled By: zou3519

fbshipit-source-id: da31ad323b8fafa7cbc502fda4e2eb6e02facfb6
2020-01-15 11:47:21 -08:00
4314620ba0 [jit] Module clone work with shared ClassType (#31970)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31970

Now that the ClassType can be shared among different module instances, we'll
preserve the sharing in clone as well, that is if the original module has
a ClassType that is shared, we'll clone this ClassType once and share it between
different module instances as well.

Test Plan:
build/test/test_jit

Imported from OSS

Differential Revision: D19406251

fbshipit-source-id: 2881c695f6e718e5432040a3817cf187a62017bf
2020-01-15 11:24:53 -08:00
62b06b9fae Rename TensorTypeId to DispatchKey (#32154)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32154

TensorTypeId -> DispatchKey
	c10/core/TensorTypeId.h -> c10/core/DispatchKey.h
	c10/core/TensorTypeId.cpp -> c10/core/DispatchKey.cpp
	TensorTypeId::* -> DispatchKey::*
	TensorTypeId type_id -> DispatchKey dispatch_key
		type_id -> dispatch_key
	TensorTypeId::NumTensorIds -> DispatchKey::NumDispatchKeys
	RealTensorTypeId -> RealDispatchKey
TensorTypeSet -> DispatchKeySet
	TensorTypeIds -> DispatchKeys
	c10/core/TensorTypeSet.h -> c10/core/DispatchKeySet.h
	c10/core/TensorTypeSet.cpp -> c10/core/DispatchKeySet.cpp
	type_set() -> key_set()
	type_set_ -> key_set_
	typeSet -> keySet
ExcludeTensorTypeIdGuard -> ExcludeDispatchKeyGuard
IncludeTensorTypeIdGuard -> IncludeDispatchKeyGuard
LocalTensorTypeSet -> LocalDispatchKeySet
	c10/core/impl/LocalTensorTypeSet.h -> c10/core/impl/LocalDispatchKeySet.h
	c10/core/impl/LocalTensorTypeSet.cpp -> c10/core/impl/LocalDispatchKeySet.cpp
	tls_local_tensor_type_set -> tls_local_dispatch_key_set
	tls_is_tensor_type_id_excluded -> tls_is_dispatch_key_excluded
	tls_set_tensor_type_id_excluded -> tls_set_dispatch_key_excluded
	tls_is_tensor_type_id_included -> tls_is_dispatch_key_included
	tls_set_tensor_type_id_included -> tls_set_dispatch_key_included
MultiDispatchTensorTypeSet -> MultiDispatchKeySet
	multi_dispatch_tensor_type_set -> multi_dispatch_key_set
tensorTypeIdToBackend -> dispatchKeyToBackend
backendToTensorTypeId -> backendToDispatchKey
initForTensorTypeSet -> initForDispatchKeySet
inferred_type_set -> inferred_key_set
computeTensorTypeId -> computeDispatchKey
PODLocalTensorTypeSet raw_local_tensor_type_set -> PODLocalDispatchKeySet raw_local_dispatch_key_set
get_default_tensor_type_id -> get_default_dispatch_key
inferred_type_id -> inferred_dispatch_key
actual_type_id -> actual_dispatch_key
typeSetToDispatchKey_ -> dispatchKeySetToDispatchKey_
get_type_id() -> get_dispatch_key()
legacyExtractTypeId -> legacyExtractDispatchKey
extractTypeId -> extractDispatchKey

Test Plan: Imported from OSS

Differential Revision: D19398900

Pulled By: pbelevich

fbshipit-source-id: 234ad19f93d33e00201b61e153b740a339035776
2020-01-15 11:16:08 -08:00
8c3ee9f2ba [Python] Deprecate use of scipy.misc.logsumexp and scipy.misc.comb (#32209)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32209

* Deprecate use of scipy.misc.logsumexp and scipy.misc.comb.
* Removed in 1.0.0 https://docs.scipy.org/doc/scipy-1.1.0/reference/generated/scipy.misc.logsumexp.html and https://docs.scipy.org/doc/scipy-1.2.1/reference/generated/scipy.misc.comb.html
* Use scipy.special.logsumexp and scipy.special.comb instead.
* This diff updates most usages of except those in experimental folders.
* This diff does NOT fix existing lint/code/TARGETS issues.
* This diff does NOT autoformat codes.

Test Plan: sandcastle auto unittests

Differential Revision: D19406460

fbshipit-source-id: 2103fa0d674d9671a0175f4ce54b3c887d22f04e
2020-01-15 10:40:47 -08:00
05088da8e9 [pytorch][PR] Fixed error in sample code of documentation (#31682)
Summary:
"in_features" and "out_features" are not defined. Possibly a typo. They should be "input_features" and "output_features" instead
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31682

Differential Revision: D19251685

Pulled By: zou3519

fbshipit-source-id: ac9e524e792a1853a16e8876d76b908495d8f35e
2020-01-15 10:34:07 -08:00
ef0f96e92f [pytorch][PR] update comment in autograd.h for locking (#32222)
Summary:
Just update the comment to make it accurate.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32222

Differential Revision: D19410428

Pulled By: albanD

fbshipit-source-id: ad13596382613c2728e674a47049ea4f563964b9
2020-01-15 09:42:24 -08:00
19bbb4fccb Stop building documentation in pytorch_linux_xenial_cuda*_build (#32187)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32187

Fixes #32058. Previously we would build documentation during the pytorch
linux cuda build. We don't actually need to do this because we have a
dedicated python_doc_build job that builds the docs. With this change,
the CUDA build should run ~10 minutes faster, giving devs faster signal.

Test Plan: - Check the CUDA (10.1) build on this PR, make sure it doesn't build the docs.

Differential Revision: D19400417

Pulled By: zou3519

fbshipit-source-id: e8fb2b818146f33330e06760377a9afbc18a71ed
2020-01-15 07:48:42 -08:00
4dce482acb dict type unification fix (#32185)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32185

Previously we would unify the contained types of dictionaries, however this breaks type safety.
```
torch.jit.script
def test(input: Dict[str, None], cond):
    if cond:
        out = input
    else:
        out: {"1": 1}
    out["hi"] = 3
```

This would only occur if a dictionary is being re-assigned across an if condition with different contained types, which is pretty unlikely. I tested `model_backward_compatibility` for all fb models and this didn't break anything. This PR is a precursor to alias analysis changes.

Also fixes `Future` type unification. Because `Future` is an immutable type, it is okay to unify the contained type.

Test Plan: Imported from OSS

Differential Revision: D19398585

Pulled By: eellison

fbshipit-source-id: ebc8812cdf5b6dba37b1cfbc2edc7d8c467b258c
2020-01-14 23:02:05 -08:00
c70bb0a4f8 Fixes to prim ops (#32179)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32179

Tensors are used as keys in dictionaries, so we need to annotate that key insertion into a dictionary inserts the key into the wildcard set. Also fixes bug with `listCopyAndSort` not copying the input list.

Test Plan: Imported from OSS

Differential Revision: D19397555

Pulled By: eellison

fbshipit-source-id: 17acdc22ff5e2dda44fd25c80450396f5592095e
2020-01-14 22:58:29 -08:00
879620e85e [caffe2] fix how np.clip is used in lengths_reducer_fused_{4,8}_rowwise_ops_test (#32086)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32086

np.clip(1, num_indices // 2, 10) -> np.clip(num_indices // 2, 1, 10)
Also change batchsize -> num_rows to match with what the variable actually does

Test Plan: CI

Reviewed By: hx89

Differential Revision: D19361521

fbshipit-source-id: 9ce864c7d7da046dc606afa5207da677ccf80f52
2020-01-14 22:53:28 -08:00
7ad03855dc Fix 'template' keyword warning with clang-cl and clang.exe (#32104)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32104

Fixes these warnings:
```
xplat\caffe2\caffe2Windows#header-mode-symlink-tree-only,headers\caffe2\operators\quantized\int8_conv_op.h(96,17): warning: use 'template' keyword to treat 'data' as a dependent template name
            W.t.data<uint8_t>(),
                ^
                template
xplat\caffe2\caffe2Windows#header-mode-symlink-tree-only,headers\caffe2\operators\quantized\int8_conv_op.h(97,17): warning: use 'template' keyword to treat 'data' as a dependent template name
            B.t.data<int32_t>(),
                ^
                template
```

Test Plan: Tested locally with clang-cl and CI for other toolchains

Reviewed By: boguscoder

Differential Revision: D19353563

fbshipit-source-id: c28afb8c1ad72fd77ef82556ba89fcf09100d1f9
2020-01-14 20:09:35 -08:00
02f09a1bbd Implement backend-agnostic rpc._wait_all_workers() utility (#32190)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32190

We need a backend-agnostic mechanism to do barrier-like operation before locally destroy RRef context and shutdown RPC Agent.

- Sort worker names.
- Elect the first name as the leader in the ordered worker names.
- Followers reports therir intent to synchronize to the leader.
- Leader also reports to itself, when `_wait_all_workers()` called.
- If all workers report their intent to proceed, leader send the command to every one to proceed.
ghstack-source-id: 96693296

Test Plan:
# Unit tests

```
buck test mode/dev-nosan //caffe2/test:rpc_fork

buck-out/gen/caffe2/test/rpc_fork\#binary.par -r test_wait_all_workers
buck-out/gen/caffe2/test/rpc_fork\#binary.par -r test_rref_leak
```

```
buck test mode/dev-nosan //caffe2/test:rpc_spawn

buck-out/gen/caffe2/test/rpc_spawn\#binary.par -r test_wait_all_workers
buck-out/gen/caffe2/test/rpc_spawn\#binary.par -r test_rref_leak
```

```
buck test mode/dev-nosan //caffe2/test:rpc_fork_thrift

buck-out/gen/caffe2/test/rpc_fork\#binary.par -r test_wait_all_workers
buck-out/gen/caffe2/test/rpc_fork_thrift\#binary.par -r test_worker_id
```

# Stress runs
```
buck test mode/dev-nosan //caffe2/test:rpc_fork_thrift -- test_stress_light_rpc --stress-runs 10
```

```
buck test mode/dev-nosan //caffe2/test:rpc_spawn_thrift -- test_stress_light_rpc --stress-runs 10
```

```
buck test mode/dev-nosan //caffe2/test:rpc_fork_thrift -- test_stress_heavy_rpc --stress-runs 10
```

```
buck test mode/dev-nosan //caffe2/test:rpc_spawn_thrift -- test_stress_heavy_rpc --stress-runs 10
```

Differential Revision: D19399908

fbshipit-source-id: 1dee607cd49adafe88534621a1c85e2736e2f595
2020-01-14 19:19:14 -08:00
7572501d40 move ProcessGroupGlooTest to gtest (#32133)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32133

We should do this to better debug the test.

Differential Revision: D19375479

fbshipit-source-id: 8c2bf61bae605a38252bb793b091ade479bea11a
2020-01-14 17:42:42 -08:00
8dc67a014f Add cummax
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/32169

Differential Revision: D19393236

Pulled By: anjali411

fbshipit-source-id: 5dac6b0a4038eb48458d4a0b253418daeccbb6bc
2020-01-14 17:19:10 -08:00
02c3493a84 Fix an invalid peephole transformation if input/output values are written to (#28455)
Summary:
fixes https://github.com/pytorch/pytorch/issues/28360
Pull Request resolved: https://github.com/pytorch/pytorch/pull/28455

Differential Revision: D19374601

Pulled By: Krovatkin

fbshipit-source-id: 622f24b40aba03e79e55a6b8d25d88417f7d8bad
2020-01-14 16:28:07 -08:00
2bd179147a Fix typo in config script to re-enable libtorch build and test in macOS CI (#32072)
Summary:
Currently, libtorch build and test are not running in macOS CI. This PR fixes the issue.

**Test Plan:**
Check that libtorch build and test are running again in macOS CI.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32072

Differential Revision: D19391909

Pulled By: yf225

fbshipit-source-id: 1ab345b099869f78e1124f1a8bd185fa51371b6a
2020-01-14 16:23:57 -08:00
f6f1e0aef5 Automatic update of fbcode/onnx to 65020daafa9183c769938b4512ce543fd5740f8f (#32125)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32125

Previous import was 57ebc587fcf3913b4be93653b0dd58c686447298

Included changes:
- **[65020daa](https://github.com/onnx/onnx/commit/65020daa)**: better error message for undefined inputs (#2540) <Yuxin Wu>
- **[8afff0e9](https://github.com/onnx/onnx/commit/8afff0e9)**: bump ORT version (#2538) <Lu Fang>
- **[3d9ca57e](https://github.com/onnx/onnx/commit/3d9ca57e)**: fix name of directory (#2537) <Prasanth Pulavarthi>
- **[df8fa2c9](https://github.com/onnx/onnx/commit/df8fa2c9)**: Repository guidelines (#2539) <Prasanth Pulavarthi>
- **[49cc2f02](https://github.com/onnx/onnx/commit/49cc2f02)**: Update CircleCI job to use Python3.6 (#2527) <bddppq>
- **[25ff79a4](https://github.com/onnx/onnx/commit/25ff79a4)**: Fix wrong model version, it's not 12 (the onnx_opset_version()), not 11 (the opset version of the latest stable), but 10 (#2478) <daquexian>
- **[7cebaed5](https://github.com/onnx/onnx/commit/7cebaed5)**: Fix Windows py3.5 CI (#2529) <bddppq>
- **[eddae00e](https://github.com/onnx/onnx/commit/eddae00e)**: Correct the order of arguments of InferShapes (#2500) <Shinichiro Hamaji>
- **[41b5afe6](https://github.com/onnx/onnx/commit/41b5afe6)**: Include <ostream> in common/status.h (#2519) <Casey Carter>
- **[423f1977](https://github.com/onnx/onnx/commit/423f1977)**: add 8 bit support to maxpool op (#2510) <Ashwini Khade>
- **[78593c2f](https://github.com/onnx/onnx/commit/78593c2f)**: add 8 bit support to reducemin and reducemax ops (#2516) <Ashwini Khade>

Test Plan: cont build

Reviewed By: benoitsteiner

Differential Revision: D19380034

fbshipit-source-id: ddce8450864a611773b2a32e2f0254c9bb6b6906
2020-01-14 15:21:37 -08:00
f3b67bf750 Fix frontend kwarg defualts error (#32146)
Summary:
This was not tested before, fixes #32139 (which was actually a false positive, functions with kwargs but without defaults on those kwargs are supported). This PR adds testing for both cases and cleans up the error reporting.
](https://our.intern.facebook.com/intern/diff/19385828/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32146

Pulled By: driazati

Differential Revision: D19385828

fbshipit-source-id: 5eab74df6d02f8e1d7ec054cafb44f909f9d637e
2020-01-14 14:59:36 -08:00
ecc3497172 Update Gemfile (#32147)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32147

### Summary

Got some security warnings regarding the ruby dependencies. This diff updates the packages in Gemfile.

```
GitHub has detected that a package defined in the ios/TestApp/Gemfile.lock file of the pytorch/pytorch repository contains a security vulnerability.

Package name: excon
Affected versions: < 0.71.0
Fixed in version: 0.71.0
Severity: LOW

Identifier(s):
GHSA-q58g-455p-8vw9
CVE-2019-16779
```

### Test Plan

- Won't affect the existing iOS CI jobs

Test Plan: Imported from OSS

Differential Revision: D19400087

Pulled By: xta0

fbshipit-source-id: 34b548d136cfd6b68fcc53bf0b243461bd7afd64
2020-01-14 14:52:50 -08:00
9bf0479b65 Fix the passing-by-ref constructor of OperatorName. (#32170)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32170

Stack from [ghstack](https://github.com/ezyang/ghstack):
Change the overload name from passing by const ref to by value and move.
* **#32170 Fix the passing-by-ref constructor of OperatorName.**

Test Plan: Imported from OSS

Differential Revision: D19396225

Pulled By: iseeyuan

fbshipit-source-id: e946c47647e1f8d23d7565cfe93f487845e7f24c
2020-01-14 13:52:12 -08:00
51a34545e9 Revert D18482934: support torch script call over rpc
Test Plan: revert-hammer

Differential Revision:
D18482934

Original commit changeset: bd82a0d820c4

fbshipit-source-id: ca5e50fb0a883ee311aeb310198d84ad28062158
2020-01-14 13:30:56 -08:00
4a26bb9b18 Suppress pip logs (#31912)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31912

### Summary

Clean up the logs from pip-install.

### Test Plan

- Don't break the iOS simulator build

Test Plan: Imported from OSS

Differential Revision: D19395526

Pulled By: xta0

fbshipit-source-id: a638a209cab801ce90c8615e7ea030b1ab0939f3
2020-01-14 12:04:53 -08:00
2bb9dbeffa omit constexpr with nvcc on clang (#32149)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32149

This is an attempt at clarifying some of the preprocessor boolean logic that was getting more and more complicated. The previous logic used constexpr with nvcc on clang; which we were getting compiler failures on in ovrsource with mode/linux/* (based on platform007).

Test Plan:
ovrsource xplat/caffe2 compiles
fbsource sandcastle green

Differential Revision: D19385409

fbshipit-source-id: 60a02bae9854388b87510afdd927709673a6c313
2020-01-14 11:49:16 -08:00
b0ac425dc4 Emit warning from deprecated torch function signatures (#32009)
Summary:
Continuation of https://github.com/pytorch/pytorch/issues/31514, fixes https://github.com/pytorch/pytorch/issues/28430
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32009

Test Plan:
I verified that the deprecation warnings only occur once on a relevant workflow. Built with:

```
buck build mode/opt //vision/fair/detectron2/tools:train_net
```

Ran with:

```
DETECTRON2_ENV_MODULE=detectron2.fb.env ~/local/train_net.par --config-file configs/quick_schedules/retinanet_R_50_FPN_instant_test.yaml --num-gpus 1 SOLVER.IMS_PER_BATCH 2
```

Inspected log:

```
[01/14 07:28:13 d2.engine.train_loop]: Starting training from iteration 0
buck-out/opt/gen/caffe2/generate-code=python_variable_methods.cpp/python_variable_methods.cpp:1299: UserWarning: This overload of add is deprecated:
add(Number alpha, Tensor other)
Consider using one of the following signatures instead:
add(Tensor other, Number alpha)
buck-out/opt/gen/caffe2/generate-code=python_variable_methods.cpp/python_variable_methods.cpp:1334: UserWarning: This overload of add_ is deprecated:
add_(Number alpha, Tensor other)
Consider using one of the following signatures instead:
add_(Tensor other, Number alpha)
[01/14 07:28:25 d2.utils.events]: eta: 0:00:10  iter: 19  total_loss: 1.699  loss_cls: 1.185  loss_box_reg: 0.501  time: 0.5020  data_time: 0.0224  lr: 0.000100  max_mem: 3722M
[01/14 07:28:35 fvcore.common.checkpoint]: Saving checkpoint to ./output/model_final.pth
```

Differential Revision: D19373523

Pulled By: ezyang

fbshipit-source-id: 75756de129645501f43ecc4e3bf8cc0f78c40b90
2020-01-14 11:44:29 -08:00
61e509b992 Skip un-runnable tests (#31965)
Summary:
`test_init_ops` calls `orthogonal_` which fails without lapack (this test was just missing a skip condition)

The cpp tests would fail with a `undefined symbol` error if run with `BUILD_TESTS=0`, so this PR skips them if that flag is `0`
](https://our.intern.facebook.com/intern/diff/19320064/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31965

Pulled By: driazati

Differential Revision: D19320064

fbshipit-source-id: d1dcd36714107688ded25a414e8969abe026bd03
2020-01-14 11:36:52 -08:00
0664c6bbfd Add ccls cache to gitignore (#31437)
Summary:
`ccls` [puts a cache](https://github.com/MaskRay/ccls/wiki/Customization#cachedirectory) in the working directory by default, this PR adds it to gitignore so git doesn't pick it up
](https://our.intern.facebook.com/intern/diff/19165007/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31437

Pulled By: driazati

Differential Revision: D19165007

fbshipit-source-id: 41012eb0ece2df60b8566d7929710b154c38ee66
2020-01-14 11:27:18 -08:00
b783a75aa3 Fix scalar^tensor derivative for scalars that are zero
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/32063

Test Plan: Imported from OSS

Differential Revision: D19394258

Pulled By: agolynski

fbshipit-source-id: 3eed0f9cc1b8c677c6948c927d007044be67fe7f
2020-01-14 11:11:23 -08:00
fa60e1150d Fix tensor^tensor derivative for 0 base entries
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/32062

Test Plan: Imported from OSS

Differential Revision: D19394259

Pulled By: agolynski

fbshipit-source-id: 836525e03573af838511ad5b4cc87ec2c1536a5e
2020-01-14 11:10:25 -08:00
1487582ba7 Switch important CI from CUDA 9 to 10.1 (#31951)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/31427
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31951

Differential Revision: D19393566

Pulled By: ezyang

fbshipit-source-id: 06f9637791494a453d3fbef765840dc9f9805196
2020-01-14 09:38:55 -08:00
dbd737158b support torch script call over rpc (#30063)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30063

This diff makes following changes:
1. Providing a new set of python rpc privated APIs, they can accept an annotated TorchScript call and this call can be serialized, deserialized and executed in C++ without GIL. These privated APIs will be binded to JIT in the future, and they are different from public APIs as future JIT binded private APIs will be able to accept qualified_name, not callables. These private APIs are subject to be deprecated once JIT supports torch script function to be a JIT type.

Also, these APIs require torch script function to be defined and annotated by users in python land, it can not be script class/module constructor or class/module methods.

2. This diff also allows public rpc APIs to accept an annotated TorchScript call and execute code path that above private APIs ran on. Therefore if users invoke an annotated TorchScript call over RPC, this call can be serialized, deserialized and executed in C++ without GIL as well.

3. The above private APIs call a newly defined C++ function to make rpc torch script call to be serialized, deserialized and executed in C++ land. This C++ function returns an ivalue::Future. so that in follow up diff this C++ function can be called when these privated APIs are binded to JIT.

4. script_call.cpp/.h and request_callback_impl.cpp files are refactored accordingly so that torch script call and builtin call can share same message type and codes.

5. refactored deserializeResponse() and added a new utility to deserizalize response to IValue

ghstack-source-id: 96638829

Test Plan: unit test

Differential Revision: D18482934

fbshipit-source-id: bd82a0d820c47a8e45b2e7c616eca06573f7d7ea
2020-01-14 09:27:04 -08:00
5f1a881cb8 Add private user tensor type IDs for experimentation. (#31830)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31830

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Differential Revision: D19330312

Pulled By: ezyang

fbshipit-source-id: fe2e53e732e946088e983ec45fed2393436f0517
2020-01-14 09:01:03 -08:00
8d472bab6b Make torch.backends.mkldnn usable without import
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/32055

Differential Revision: D19373220

Pulled By: ezyang

fbshipit-source-id: 50ab3ff70fc893c81123419c4d3cf2e3e48a0a93
2020-01-14 08:19:19 -08:00
77c78b7d28 remove .data from torch/nn doc
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/31481

Test Plan: Imported from OSS

Differential Revision: D19303242

Pulled By: albanD

fbshipit-source-id: 4f650df9e9e302a299175967bcc6e30a5099fa2a
2020-01-14 07:30:42 -08:00
c036fbdc5c remove .data from torch/jit
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/31480

Test Plan: Imported from OSS

Differential Revision: D19303244

Pulled By: albanD

fbshipit-source-id: ec66b32353f2f9b16072185ecde3ae8abbe09a35
2020-01-14 07:30:37 -08:00
26621d101f remove simple .data from torch/nn
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/31482

Test Plan: Imported from OSS

Differential Revision: D19303185

Pulled By: albanD

fbshipit-source-id: 610eae096bab24a7b9f651b9af2e3ecd19df55b0
2020-01-14 07:29:24 -08:00
62b1a5f846 Updating submodules
Summary:
GitHub commits:

2156e48924
8c5b4af317
be69716784
4f76ad1fab
0b12b2f13c
0449b53cb1
1481689822
43ffa9bbf0
787d6b6c93

Test Plan: n/a

Reviewed By: yns88

fbshipit-source-id: b0080fd1a4c26efbe8f26245fbba7740fbac08f3
2020-01-13 20:15:38 -08:00
a472f0201f Added support for Dim operation in ONNX export (#31928)
Summary:
While ONNX does not currently directly support the Dim operation on a
tensor, we can provide the same functionality with two ONNX operations.
This allows us to support Dim for all opsets. It may be adventageous to
add support for Dim into a future ONNX opset, and use that for more
efficient code.
While testing dim op found that there is an issue with empty blocks
withing if statements. Modified graph generation to prevent generation
of empty if blocks.

Fixes https://github.com/pytorch/pytorch/issues/27569
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31928

Reviewed By: hl475

Differential Revision: D19376602

Pulled By: houseroad

fbshipit-source-id: 111682b058a5341f5cca6c1a950c83ae412a4c6c
2020-01-13 19:42:43 -08:00
c474952b5d Updating submodules
Summary:
GitHub commits:

1f8321394d
024c1d0b43
1d57089fc3
3c6f1f782c
21a27b0f8e
23bb716b62
894c6d21af
e3e241d700
ac4e11d84a
c35803ad68
647388f265
50a3288630
b197f0c95a

Test Plan: n/a

Reviewed By: yns88

fbshipit-source-id: 1807ac876a126d221c257edbd4732f9a1240e869
2020-01-13 18:07:08 -08:00
470c496eb2 use cholesky_inverse to compute precision matrix (#32092)
Summary:
Resolves a long-standing TODO. :D

I also fix the docs of lowrank_mvn which is raised at [forum](https://discuss.pytorch.org/t/lowrankmultivariatenormal-example-raises-valueerror/65381).

cc vishwakftw
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32092

Differential Revision: D19373912

Pulled By: ezyang

fbshipit-source-id: b13129d7c30e87c6f8a6ced86601762a3f5c5624
2020-01-13 16:35:46 -08:00
f003008d6e Allow TCPStore to pick a port to bind to. (#31674)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31674

The motivation of this PR was to fix the problem where we would see
"Address already in use" issues for TCPStoreTest due to port conflicts. To
resolve this:

1. We can now pass in port 0 for TCPStore and retrieve the port it actually
bound to using a new getPort() API.
2. Added a `wait` flag to TCPStore constructor indicating whether or not it
should wait for workers (defaults to true).
3. Made `waitForWorkers` a public API to ensure that we can construct TCPStore
without waiting and wait for workers separately. This helps in TCPStoreTest to
ensure we can retrieve the port and pass it to the client stores.
ghstack-source-id: 96486845

Test Plan: waitforbuildbot

Differential Revision: D19240947

fbshipit-source-id: 7b1d1cb2730209fac788764845f1dbbe73d75d9b
2020-01-13 14:23:31 -08:00
632d6fc583 Revert D19373615: Fix typo in config script to re-enable libtorch build and test in macOS CI
Test Plan: revert-hammer

Differential Revision:
D19373615

Original commit changeset: 28686ef58953

fbshipit-source-id: 432b04adfd9d010e1965846a386f117ebc80e013
2020-01-13 14:11:30 -08:00
701ca68882 Docs entry for the is_quantized
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/32075

Test Plan: Imported from OSS

Differential Revision: D19353861

Pulled By: z-a-f

fbshipit-source-id: 4249216ac9a4af354a251c62181d65bc14cbfd3e
2020-01-13 13:54:35 -08:00
d53ce5e4cd Updating submodules
Summary:
GitHub commits:

b5718e35c8
e1af1b0550
8a34e7f444
e9e70ade5b
d9e693ece0
329347c63c
671b5aa064
7f3bb0bf37
6207e92b9b
d4b95d87d4

Test Plan: n/a

Reviewed By: yns88

fbshipit-source-id: 3c9131bdee0bf8a8ca5c679a95e8ff8a6f805762
2020-01-13 13:30:11 -08:00
d97413eb7a Change python/cpp docs CI to use a CPU-only image (#32102)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32102

Previously, the docs CI depended on our CUDA xenial py3 build. This
meant that the turnaround time to get signal for docs was very slow
(I've seen builds that go as much as 3 hours).

Fortunately, the docs CI do not (and should not!) rely on CUDA. This
PR changes it so that the docs CI runs on a CPU-only machine.

Fixes #29995

Test Plan:
- Check CI status on this PR by reading logs for the python and cpp docs
builds.
- I built the docs locally, once for CPU, and once for CUDA, and
verified (via diff) that the pages were exactly the same)

Differential Revision: D19374078

Pulled By: zou3519

fbshipit-source-id: 3eb36f692c3c0632d2543d3439c822d51a87b809
2020-01-13 12:01:49 -08:00
1f34801460 More robust mangling (#31978)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31978

Currently we keep a `mangleIndex_` that's intenral to compilation unit and
just increment the index when we found the original name is mangled, this doesn't
guarantee the new name is not defined.
This PR fixes the problem by querying whether the new name is defined or not.
fixes: https://github.com/pytorch/pytorch/issues/31268

Test Plan:
fixes the issue

Imported from OSS

Differential Revision: D19350535

fbshipit-source-id: fe3262b2838d4208ab72e2cd4a5970b3a792ae86
2020-01-13 11:11:50 -08:00
a3dd44653f Fix typo in config script to re-enable libtorch build and test in macOS CI (#32072)
Summary:
Currently, libtorch build and test are not running in macOS CI. This PR fixes the issue.

**Test Plan:**
Check that libtorch build and test are running again in macOS CI.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32072

Differential Revision: D19373615

Pulled By: yf225

fbshipit-source-id: 28686ef5895358a2b60db46b1946f21c58c6a18e
2020-01-13 10:25:10 -08:00
5988d36f58 Fix cumprod error for tensors with zero elements (#32070)
Summary:
Currently cumprod crashes for tensors with non-empty dimensions but with zero elements, which could happen when some dimension is zero. This commit fixes the error by checking both dim() and numel() in cumprod backward
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32070

Differential Revision: D19373200

Pulled By: ezyang

fbshipit-source-id: d8ecde33f3330b40a7c611f6faa3b1d707ef2a9a
2020-01-13 09:50:27 -08:00
695c4f1bab Fix a typo in function name: liner -> linear
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/32068

Test Plan: Imported from OSS

Differential Revision: D19373360

Pulled By: nairbv

fbshipit-source-id: 7696300b5c1dbcd7991fda3311d68807b2960982
2020-01-13 09:33:50 -08:00
8e93159fb6 CUDA 8 cleanup (#32013)
Summary:
CUDA 8 is no longer supported
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32013

Differential Revision: D19372963

Pulled By: ezyang

fbshipit-source-id: e584d7d5d5908933221ea4400234b3e6e7c32e7a
2020-01-13 08:48:48 -08:00
9a4219eb39 Install complete set of headers for ROCm build (#32076)
Summary:
This PR adds a more complete list of pytorch header files to be installed at build time. It also fixes one instance of including a header from local src directory instead of installed directory.
A more complete set of headers enable other modules to correctly work with pyTorch built for ROCm.

cc: ezyang bddppq iotamudelta
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32076

Differential Revision: D19372933

Pulled By: ezyang

fbshipit-source-id: 3b5f3241c001fa05ea448c359a706ce9a8214aa0
2020-01-13 08:33:28 -08:00
4002fec509 Display NVCC version in CI for convenience to look at
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/32069

Differential Revision: D19372943

Pulled By: ezyang

fbshipit-source-id: c78e5779d4139e42df1f235db65d8c0399ffa1a2
2020-01-13 08:16:52 -08:00
e74a215ade Changed clip_grad_norm_ total_norm calculation (#32020)
Summary:
Redefines the computation of the total_norm to increase performance as shown in https://github.com/pytorch/pytorch/issues/31474.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32020

Differential Revision: D19353309

Pulled By: ngimel

fbshipit-source-id: bf7530dcd39f56614a211b5f21445864d4f2e875
2020-01-13 08:13:46 -08:00
77c2c78e01 Fix typographical error in torch.triu docstring (#32067)
Summary:
below --> above

Fixes https://github.com/pytorch/pytorch/issues/32032
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32067

Differential Revision: D19355788

Pulled By: zou3519

fbshipit-source-id: dc7a2538a78cd11e72d47ad923ef50599a5a87e2
2020-01-13 07:21:33 -08:00
14593f077f remove list specialization from ivalue (#30734)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30734

What are specialized lists?

The IValues that hold List[int], List[Tensor], and List[AnythingElse] are different C++ types.
e.g. List[int] has a std::vector<int> while List[AnythingElse] holds a std::vector<IValue>.

Why do we have specialized lists?

When we first created the JIT we needed to bind the ATen C++ API which has std::vector<int>,
std::vector<Tensor> as inputs. The easiest way to match this API was to make our IValues contain
these same types. Conversion was just unwrapping the IValue, very easy and cheap.

What is the problem with specialized lists?

We end up with significant special cases through the compiler. Other types like Dict are not
specialized. So in the Pickler, for instance, there is a single piece of logic to handle
their serialization. For Lists, we end up with multiple cases. Furthermore, it doesn't
match Python, leading to problems along translation boundaries. Our pickle serialization
is slightly different than python, so it is harder to load objects from our IValue serialization
as Python values.

They also make it harder to provide an easy-to-use user API. We'd like to match pybind11 for C++
bindings to TorchScript. This would entail having a single torch::List class (untemplated)
that can be used to construct inputs. This is made much harder if the underlying ivalue needs
to be different depending on the type inside the list. The ideal case would be to have a constructor like

```
template<typename T>
List(std::vector<T> foo);
```

It would then set up the type tags correctly based on type T, without the need for passing tags.

Do specialized lists improve perf?

Not in a way we have been able to measure. Our major concern initially was having to translate
a std::vector<IValue> to std::vector<int> to call ATen functions. This was especially a concern
for aten::_convolution which takes a number of mostly-constant lists of integers. However,
when we measure the effect of actually having to do this conversion for an aten::_convolution,
it does not take measurable time (benchmark results below).
This is true even if you use a trivial convolution (e.g. 1x1x1), and comment out the actual convolution code.

What are the issues removing them?

This PR removes list specialization but keeps the serialization format, and IValue APIs almost exactly
the same. The only visible change is that toTensorListRef and family have turned into toTensorVector
because they now return by value a copy of the list as a vector.

Further PRs can then clean up the complexity issues that arose from speclization. This will likely
involve removing the isTensorList/isIntList functions, and refactoring the code that used them to
work generically. At some point we will also change serialization to no longer write specialized
lists in the pickle binary. This is forward incompatible, so will go in its own PR.

Benchmark:
```
import torch

import torch.nn as nn
import torch.nn.functional as F
import time

class MnistNet(nn.Module):
    def __init__(self):
        super(MnistNet, self).__init__()
        self.conv1 = nn.Conv2d(1, 1, kernel_size=1)
        self.conv2 = nn.Conv2d(1, 1, kernel_size=1)

    def forward(self, x):
        for i in range(10):
            x = F.relu(self.conv1(x))
            x = F.relu(self.conv2(x))
        return x

model = MnistNet()
x = torch.rand(1, 1, 1, 1)
r = torch.jit.trace(model, x )
r(x)
r(x)
r(x)
r(x)
print(torch.jit.last_executed_optimized_graph())

while True:
    b = time.time()
    for i in range(100):
        r(x)
    e = time.time()
    print(e - b)
```

Results (no observable difference):

```
Before (actual conv)
0.13251137733459473
0.13260436058044434
0.13276338577270508
0.1327497959136963
0.13250041007995605
0.13270330429077148
0.13290190696716309
0.13265132904052734
0.13274288177490234
0.1326758861541748
0.13253355026245117
0.13254785537719727
0.13260746002197266
0.13285017013549805
0.13264012336730957
0.132490873336792
0.13280034065246582
0.13243484497070312
0.1325232982635498
0.1326127052307129
0.13264131546020508
0.13274383544921875
0.13298296928405762
0.1326909065246582
-------------------
After (actual conv)
0.13127517700195312
0.13150334358215332
0.13092470169067383
0.13102364540100098
0.13134360313415527
0.13155555725097656
0.13314104080200195
0.13151955604553223
0.13160037994384766
0.1315293312072754
0.13137340545654297
0.13148093223571777
0.131455659866333
0.1327371597290039
0.13134026527404785
0.13152337074279785
0.13151192665100098
0.13165974617004395
0.13403725624084473
0.13251852989196777
0.13135504722595215
0.1315624713897705
0.1317615509033203
0.1314380168914795
0.13157200813293457
--------------------

The following replace the convolution operator with a no-op, to show
that even if the conv op was made faster, then we still would not see
a difference:

Before (fake conv)
0.0069539546966552734
0.0069522857666015625
0.007120847702026367
0.007344722747802734
0.007689952850341797
0.007932662963867188
0.00761723518371582
0.007501363754272461
0.007532835006713867
0.007141828536987305
0.007174253463745117
0.007114410400390625
0.007071495056152344
------------------
After (fake conv)
0.007458209991455078
0.007337093353271484
0.007268190383911133
0.007313251495361328
0.007306575775146484
0.007468700408935547
0.0073091983795166016
0.007308483123779297
0.007538318634033203
0.007356882095336914
0.007464170455932617
0.007372140884399414
```

Test Plan: Imported from OSS

Differential Revision: D18814702

Pulled By: zdevito

fbshipit-source-id: 0371c73b63068fdc12f24b801371ea90f23531a6
2020-01-12 18:28:25 -08:00
46f32e136a Revert "Support PyTorch ROCm CI on Ubuntu18.04 (#31886)" (#31946)
Summary:
This reverts commit 4ee9c562188ae930cb2520cfce7805f55acaf968.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31946

Differential Revision: D19368391

Pulled By: bddppq

fbshipit-source-id: 63d032a5256ff4da7247fb1092be314c5b133eb6
2020-01-12 14:04:38 -08:00
927c2a02b0 enable autograd profiler to work with RPC and RRef. (#31381)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31381

This PR adds support for being able to profile both sync and async RPCs, so that users can use the autograd profiler and be able to view metrics such as RPC latency and number of calls in the profiler output.

The way this is implemented is by using the existing `RecordFunction` class provided by the autograd profiler. We create a `RecordFunction` instance when sending an RPC, if autograd profiling is enabled. We also invoke the starting callbacks on this `RecordFunction` instance, this does things such as start the CPU timer.  This instance is then persisted across the lifetime of the RPC by attaching it to the `Future` created by the RPC. When the RPC is finished (i.e. when `future->markComplete()` is called), we run the `RecordFunction` instance's end callbacks, which among other things, stops the timer so that we get the correct RPC latency.

The `RecordFunction` and relevant callbacks in `profiler.cpp` are modified slightly to support running end callbacks from a different thread (which is needed since futures are marked as completed by a different thread than the main RPC thread). By default, the autograd profiler uses a `thread_local` list of `Events` and `thread_id`. However, since we'd like to run the `RecordFunction`'s callbacks from a different thread, we would like to access the list of `Events` created by the original thread. This is done by attaching the `thread_id` for the event to the `RecordFunction`, and then looking up the event with that thread in `all_event_lists` (see the changes in `profiler.cpp`). To ensure that the original behavior does not change in the profiler, this described behavior is only run when a user calls `setOverrideThreadId()` on the `RecordFunction` object.
ghstack-source-id: 96527291

Test Plan: Added a unit test.

Differential Revision: D19053322

fbshipit-source-id: 9a27a60c809fc4fdb16fa5d85085f3b6b21abfbb
2020-01-10 21:26:18 -08:00
20e5c90d82 accept url query when rank or wolrd_size is specified (#32016)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32016

The previously logic will raise exception when there is query in url when rank or world_size is specified
The fix will parse the url and stitch rank and world_size into url.query and regenerate the url.

Test Plan: f161291877

Differential Revision: D19337929

fbshipit-source-id: 6bb3a07716dda5233553804000b706052ff18db8
2020-01-10 18:27:06 -08:00
b6cee03e29 C++ tensor indexing: add Slice / TensorIndex (#30424)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30424

`at::indexing::TensorIndex` is used for converting C++ tensor indices such as `{None, "...", Ellipsis, 0, true, {1, None, 2}, torch::tensor({1, 2})}` into its equivalent `std::vector<TensorIndex>`, so that further tensor indexing operations can be performed using the supplied indices.

Test Plan: Imported from OSS

Differential Revision: D18695902

Pulled By: yf225

fbshipit-source-id: d73e14a411cdbec815866b02e75ffd71a9186e89
2020-01-10 17:53:41 -08:00
638e4ad8b9 Updated function definition for torch.mode and torch.median in torch docs (#32003)
Summary:
Issue: https://github.com/pytorch/pytorch/issues/32002
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32003

Differential Revision: D19334306

Pulled By: anjali411

fbshipit-source-id: fe6a7cc7295b2d582a0b528f353ec64d9085e8c5
2020-01-10 13:13:54 -08:00
28c1258f18 Scale init for batch-norm and layer-norm (#31983)
Summary:
Per discussion with Fei Tian, we need to add a `scale_init_value` to scale down the output of normalization such as batch-norm and layer-norm.

Currently we have `sparse_normalization_options` to normalize embedding pooling output. By default, scale = 1.0, we found it's better to set scale from 0.025 to 0.1 https://fb.quip.com/MiKUAibEaYhH

Besides, I am removing the tags from normalizers because it makes more sense to calculate norm ops in distributed trainers, not ps.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31983

Test Plan:
Testing LN and BN after sum-pooling --
baseline f160348514
LN: f160348609
BN: f160348710

{F226106518}

Layer norm after sum-pooling fwd_net https://fburl.com/sa4j207n
Layer norm after dot-prod fwd_net https://fburl.com/twggwyvb

## Unit Tests
Testing normalization after pooling
```
buck test caffe2/caffe2/fb/dper/layer_models/tests/split_1:sparse_nn_test_4 -- test_sparse_pooling_batch_normalization
buck test caffe2/caffe2/fb/dper/layer_models/tests/split_1:sparse_nn_test_4 -- test_dense_sparse_pooling_batch_normalization
buck test caffe2/caffe2/fb/dper/layer_models/tests/split_1:sparse_nn_test_4 -- test_sparse_pooling_layer_normalization
buck test caffe2/caffe2/fb/dper/layer_models/tests/split_1:sparse_nn_test_4 -- test_dense_sparse_pooling_layer_normalization
```

Testing normalization after dot-prod
```
buck test caffe2/caffe2/fb/dper/layer_models/tests/split_1:sparse_nn_test -- test_last_layer_use_batch_norm
buck test caffe2/caffe2/fb/dper/layer_models/tests/split_1:sparse_nn_test -- test_last_layer_use_layer_norm
```

Differential Revision: D19277618

Pulled By: SilunWang

fbshipit-source-id: ea323e33e3647ba55d2e808ef09d94ad7b45b934
2020-01-10 11:55:56 -08:00
c5af0afdcb catch exceptions in ProcessGroupAgent::enqueueSend and report them. (#31023)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31023

Adds support to catch exceptions in ProcessGroupAgent::enqueueSend and
report them in the future by marking the future as completed with an exception
indicating the error. An example of when this could happen is if the receiving
side aborts when the sender is sending the message, previously, we would hang
until the timeout is hit, and the original exception would be lost.
ghstack-source-id: 96498386

Test Plan: Added a relevant unit test: `test_sender_exceptions` in rpc_test.py

Differential Revision: D18901981

fbshipit-source-id: 08de26936c4ad45b837219a247088cbea644c04c
2020-01-10 11:39:57 -08:00
346005d3ed integrate op dependency analysis process into CI
Summary:
Custom build and internal build will depend on the analysis result so
let's make sure it doesn't break.

Tested locally with LLVM-5.0, LLVM-7 and LLVM-8.

Test Plan: - check CI result

Differential Revision: D18894637

Pulled By: ljk53

fbshipit-source-id: 657854e4bed85a84907e3b6638d158823a56ec80
2020-01-10 11:37:37 -08:00
16b8ca56b6 update docker image version (#31848)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31848

Trigger docker image build and bump up docker image version.

Test Plan: - Check tag at: http://ossci-docker.s3-website.us-east-1.amazonaws.com/pytorch.html

Differential Revision: D19282725

Pulled By: ljk53

fbshipit-source-id: a27b2831a92ff54d80ccbae0f18dadff0469254c
2020-01-10 11:37:32 -08:00
03ff3eb94d skip TEST_DILL on Python2 (#32027)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32027

The test was added in #30985 for #28313. Seems the fix only works for
Python3 but doesn't work on Python2. The current Python2 CI docker image
doesn't have `dill` module installed at all so it's not captured.

I'm trying to build and push new CI docker image which has `dill` installed
and I verified it's the latest version 0.3.1.1 but the fix doesn't seem
to work and blocks me from upgrading image version. It works for Python3
docker image though...

Here is a succeeded job with old image (no dill installed):
https://app.circleci.com/jobs/github/pytorch/pytorch/4192688

Here is a failed job with new image (dill installed):
https://app.circleci.com/jobs/github/pytorch/pytorch/4192679

This PR bypasses the test for Py2 to unblock docker image change. We
can figure out a proper fix for Py2 later.

Test Plan: Imported from OSS

Differential Revision: D19341451

Pulled By: ljk53

fbshipit-source-id: d5768de8cbaf1beba8911da76f4942b8f210f2d2
2020-01-10 11:37:28 -08:00
ab5eb65e74 gate torch_global_deps with BUILD_SHARED_LIBS flag (#32011)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32011

Run into build problem with Ninja + code analysis build as follows:
```
The install of the torch_global_deps target requires changing an RPATH from
the build tree, but this is not supported with the Ninja generator unless
on an ELF-based platform.
```

Seems we don't need build the target for static build mode?

Verified code analyzer works with the patch.

Test Plan: Imported from OSS

Differential Revision: D19336818

Pulled By: ljk53

fbshipit-source-id: 37f45a9392c45ce92c1df40d739b23954e50a13a
2020-01-10 11:37:24 -08:00
f995ec2076 Remove qconfig_dict in top level eager mode quantization API (#31972)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31972

Since eager mode quantization requires many user modifications, we can't
consistently quantize a given model by just changing qconfig_dict, therefore
the top level `qconfig_dict` is not that useful.
fixes: https://github.com/pytorch/pytorch/issues/31549

Test Plan:
.

Imported from OSS

Differential Revision: D19330691

fbshipit-source-id: 8aee6e5249e0c14e8a363ac1a83836e88887cd7d
2020-01-10 11:04:37 -08:00
c5a362a96d Updating submodules
Summary:
GitHub commits:

b14a430062
c1c5426018
42d18a93c4
a4e11e8721
25c971b0c3
b2ea65322f
e86573b6de
31d721301c
687119aeaf
25cad9547d
428862c045
95640f80d8
0e4db05b37
5cb83de9cc
4fdb800074

Test Plan: n/a

Reviewed By: 2d2d2d2d2d

fbshipit-source-id: bcd533c540c1170844dbf2b23538d72c95a0d304
2020-01-10 11:01:20 -08:00
8098ae455c Move rshift to Aten (#31594)
Summary:
VitalyFedyunin , this PR is about move rshift to Aten.
Benchmark script :
```
import timeit
import torch
torch.manual_seed(1)

for n, t in [(10, 100000),(1000, 10000)]:
    print('__rshift__ (a.numel() == {}) for {} times'.format(n, t))
    for device in ('cpu', 'cuda'):
        for dtype in ('torch.int8', 'torch.uint8', 'torch.int16', 'torch.int32', 'torch.int64'):
            print(f'device: {device}, dtype: {dtype}, {t} times', end='\t\t')
            print(timeit.timeit(f'a >> b\nif "{device}" == "cuda": torch.cuda.synchronize()', setup=f'import torch; a = torch.randint(0, 10, ({n},), dtype = {dtype}, device="{device}"); b = torch.randint(0, 10, ({n},), dtype = {dtype}, device="{device}")', number=t))
        for dtype in ('torch.float32', 'torch.float64'):
            print(f'device: {device}, dtype: {dtype}, {t} times', end='\t\t')
            print(timeit.timeit(f'a >> b\nif "{device}" == "cuda": torch.cuda.synchronize()', setup=f'import torch; a = torch.randn({n}, dtype = {dtype}, device="{device}"); b = torch.randn({n}, dtype = {dtype}, device="{device}")', number=t))

for n, t in [(10, 100000),(1000, 10000)]:
    print('__irshift__ (a.numel() == {}) for {} times'.format(n, t))
    for device in ('cpu', 'cuda'):
        for dtype in ('torch.int8', 'torch.uint8', 'torch.int16', 'torch.int32', 'torch.int64'):
            print(f'device: {device}, dtype: {dtype}, {t} times', end='\t\t')
            print(timeit.timeit(f'a >> b\nif "{device}" == "cuda": torch.cuda.synchronize()', setup=f'import torch; a = torch.randint(0, 10, ({n},), dtype = {dtype}, device="{device}"); b = torch.tensor(5, dtype = {dtype}, device="{device}")', number=t))
        for dtype in ('torch.float32', 'torch.float64'):
            print(f'device: {device}, dtype: {dtype}, {t} times', end='\t\t')
            print(timeit.timeit(f'a >> b\nif "{device}" == "cuda": torch.cuda.synchronize()', setup=f'import torch; a = torch.randn({n}, dtype = {dtype}, device="{device}"); b = torch.tensor(5, dtype = {dtype}, device="{device}")', number=t))
```
Device: **Tesla P100, skx-8180**
Cuda verison: **9.0.176**

Before:
```
__rshift__ (a.numel() == 10) for 100000 times
device: cpu, dtype: torch.int8, 100000 times            0.17183916084468365
device: cpu, dtype: torch.uint8, 100000 times           0.16587729007005692
device: cpu, dtype: torch.int16, 100000 times           0.16659130714833736
device: cpu, dtype: torch.int32, 100000 times           0.17177579551935196
device: cpu, dtype: torch.int64, 100000 times           0.17860156949609518
device: cpu, dtype: torch.float32, 100000 times         0.23938780091702938
device: cpu, dtype: torch.float64, 100000 times         0.22591270506381989
device: cuda, dtype: torch.int8, 100000 times           1.2709560776129365
device: cuda, dtype: torch.uint8, 100000 times          1.2692269310355186
device: cuda, dtype: torch.int16, 100000 times          1.2785452520474792
device: cuda, dtype: torch.int32, 100000 times          1.2733035255223513
device: cuda, dtype: torch.int64, 100000 times          1.2785427365452051
device: cuda, dtype: torch.float32, 100000 times                1.2980637094005942
device: cuda, dtype: torch.float64, 100000 times                1.3062487514689565
__rshift__ (a.numel() == 1000) for 10000 times
device: cpu, dtype: torch.int8, 10000 times             0.03122080024331808
device: cpu, dtype: torch.uint8, 10000 times            0.030290847644209862
device: cpu, dtype: torch.int16, 10000 times            0.024531075730919838
device: cpu, dtype: torch.int32, 10000 times            0.024743229150772095
device: cpu, dtype: torch.int64, 10000 times            0.025563121773302555
device: cpu, dtype: torch.float32, 10000 times          0.6707976600155234
device: cpu, dtype: torch.float64, 10000 times          0.5344798369333148
device: cuda, dtype: torch.int8, 10000 times            0.12768010422587395
device: cuda, dtype: torch.uint8, 10000 times           0.12681372743099928
device: cuda, dtype: torch.int16, 10000 times           0.12995595764368773
device: cuda, dtype: torch.int32, 10000 times           0.12989260721951723
device: cuda, dtype: torch.int64, 10000 times           0.12804713658988476
device: cuda, dtype: torch.float32, 10000 times         0.13013121113181114
device: cuda, dtype: torch.float64, 10000 times         0.1406280631199479
__irshift__ (a.numel() == 10) for 100000 times
device: cpu, dtype: torch.int8, 100000 times            0.3805475188419223
device: cpu, dtype: torch.uint8, 100000 times           0.36341007333248854
device: cpu, dtype: torch.int16, 100000 times           0.36908434610813856
device: cpu, dtype: torch.int32, 100000 times           0.3669992135837674
device: cpu, dtype: torch.int64, 100000 times           0.37847711704671383
device: cpu, dtype: torch.float32, 100000 times         0.4311870699748397
device: cpu, dtype: torch.float64, 100000 times         0.44503832422196865
device: cuda, dtype: torch.int8, 100000 times           1.4343859804794192
device: cuda, dtype: torch.uint8, 100000 times          1.4298221375793219
device: cuda, dtype: torch.int16, 100000 times          1.4460898758843541
device: cuda, dtype: torch.int32, 100000 times          1.4518025070428848
device: cuda, dtype: torch.int64, 100000 times          1.4456725595518947
device: cuda, dtype: torch.float32, 100000 times                1.4610810624435544
device: cuda, dtype: torch.float64, 100000 times                1.4736663019284606
__irshift__ (a.numel() == 1000) for 10000 times
device: cpu, dtype: torch.int8, 10000 times             0.05944254994392395
device: cpu, dtype: torch.uint8, 10000 times            0.058085592463612556
device: cpu, dtype: torch.int16, 10000 times            0.05094402376562357
device: cpu, dtype: torch.int32, 10000 times            0.050842881202697754
device: cpu, dtype: torch.int64, 10000 times            0.06223891582340002
device: cpu, dtype: torch.float32, 10000 times          0.7006897022947669
device: cpu, dtype: torch.float64, 10000 times          0.5614962242543697
device: cuda, dtype: torch.int8, 10000 times            0.1461706068366766
device: cuda, dtype: torch.uint8, 10000 times           0.14335164614021778
device: cuda, dtype: torch.int16, 10000 times           0.1448021186515689
device: cuda, dtype: torch.int32, 10000 times           0.14513055887073278
device: cuda, dtype: torch.int64, 10000 times           0.1439579650759697
device: cuda, dtype: torch.float32, 10000 times         0.14666561130434275
device: cuda, dtype: torch.float64, 10000 times         0.1540807681158185
```
After:
```
_rshift__ (a.numel() == 10) for 100000 times
device: cpu, dtype: torch.int8, 100000 times            0.16366520430892706
device: cpu, dtype: torch.uint8, 100000 times           0.16091545950621367
device: cpu, dtype: torch.int16, 100000 times           0.1659633992239833
device: cpu, dtype: torch.int32, 100000 times           0.1682385364547372
device: cpu, dtype: torch.int64, 100000 times           0.17289020214229822
device: cpu, dtype: torch.float32, 100000 times         0.24359441827982664
device: cpu, dtype: torch.float64, 100000 times         0.21783945057541132
device: cuda, dtype: torch.int8, 100000 times           1.2517220517620444
device: cuda, dtype: torch.uint8, 100000 times          1.260181212797761
device: cuda, dtype: torch.int16, 100000 times          1.2681935774162412
device: cuda, dtype: torch.int32, 100000 times          1.2764465296640992
device: cuda, dtype: torch.int64, 100000 times          1.294325228780508
device: cuda, dtype: torch.float32, 100000 times                1.3062216322869062
device: cuda, dtype: torch.float64, 100000 times                1.303224254399538
__rshift__ (a.numel() == 1000) for 10000 times
device: cpu, dtype: torch.int8, 10000 times             0.027045012451708317
device: cpu, dtype: torch.uint8, 10000 times            0.026978280395269394
device: cpu, dtype: torch.int16, 10000 times            0.025594274513423443
device: cpu, dtype: torch.int32, 10000 times            0.02593063935637474
device: cpu, dtype: torch.int64, 10000 times            0.02668109256774187
device: cpu, dtype: torch.float32, 10000 times          0.09746317192912102
device: cpu, dtype: torch.float64, 10000 times          0.1644029449671507
device: cuda, dtype: torch.int8, 10000 times            0.12530914042145014
device: cuda, dtype: torch.uint8, 10000 times           0.12615622486919165
device: cuda, dtype: torch.int16, 10000 times           0.12741118855774403
device: cuda, dtype: torch.int32, 10000 times           0.1284919548779726
device: cuda, dtype: torch.int64, 10000 times           0.12974756956100464
device: cuda, dtype: torch.float32, 10000 times         0.13044228963553905
device: cuda, dtype: torch.float64, 10000 times         0.13918257877230644
__irshift__ (a.numel() == 10) for 100000 times
device: cpu, dtype: torch.int8, 100000 times            0.19456563983112574
device: cpu, dtype: torch.uint8, 100000 times           0.190769555978477
device: cpu, dtype: torch.int16, 100000 times           0.2002257639542222
device: cpu, dtype: torch.int32, 100000 times           0.20456529594957829
device: cpu, dtype: torch.int64, 100000 times           0.2043834924697876
device: cpu, dtype: torch.float32, 100000 times         0.2832390898838639
device: cpu, dtype: torch.float64, 100000 times         0.2582795573398471
device: cuda, dtype: torch.int8, 100000 times           1.304957083426416
device: cuda, dtype: torch.uint8, 100000 times          1.3216373259201646
device: cuda, dtype: torch.int16, 100000 times          1.3238621400669217
device: cuda, dtype: torch.int32, 100000 times          1.333009460940957
device: cuda, dtype: torch.int64, 100000 times          1.3835567953065038
device: cuda, dtype: torch.float32, 100000 times                1.4483617274090648
device: cuda, dtype: torch.float64, 100000 times                1.4179155295714736
__irshift__ (a.numel() == 1000) for 10000 times
device: cpu, dtype: torch.int8, 10000 times             0.03196091763675213
device: cpu, dtype: torch.uint8, 10000 times            0.03048650734126568
device: cpu, dtype: torch.int16, 10000 times            0.03048624936491251
device: cpu, dtype: torch.int32, 10000 times            0.030591044574975967
device: cpu, dtype: torch.int64, 10000 times            0.031246556900441647
device: cpu, dtype: torch.float32, 10000 times          0.10918692220002413
device: cpu, dtype: torch.float64, 10000 times          0.18057993799448013
device: cuda, dtype: torch.int8, 10000 times            0.13614848721772432
device: cuda, dtype: torch.uint8, 10000 times           0.130373639985919
device: cuda, dtype: torch.int16, 10000 times           0.1332557238638401
device: cuda, dtype: torch.int32, 10000 times           0.1331850504502654
device: cuda, dtype: torch.int64, 10000 times           0.1363008264452219
device: cuda, dtype: torch.float32, 10000 times         0.1370363561436534
device: cuda, dtype: torch.float64, 10000 times         0.1442740885540843
```
Fix https://github.com/pytorch/pytorch/issues/24512 #24516  https://github.com/pytorch/pytorch/issues/24659  https://github.com/pytorch/pytorch/issues/24663
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31594

Differential Revision: D19346542

Pulled By: ezyang

fbshipit-source-id: 37dd00b86898810b850cf4769c3af8aea6d4596b
2020-01-10 10:52:15 -08:00
a201027e93 Abstract atomic add calls (#31992)
Summary:
Instead of a mixture of direct calls to library provided atomicAdd calls, such as float atomicAdd(float*, float) and calls provided internally, such as void atomicAdd(long*, long), abstract to one API void gpuAtomicAdd(T*, T) in THCAtomics.cuh for the PyTorch backend.

The advantage of this approach is that it allows us to more easily distinguish between capabiltiies of different platforms (and their versions). Additionally, the abstraction of void returning atomicAdds allows us to, in the future, support fast HW instructions on some platforms that will not return the previous value.

Call sites that do not satisfy above conditions and are either highly platform specific (__half2 atomicAdd fast path in one operator) or require the return explicitly (some int atomicAdd invocations) are left untouched. The Caffe2 backend also remains untouched.

While here, add a bunch of includes of THCAtomics.cuh that were missing before.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31992

Differential Revision: D19330220

Pulled By: ezyang

fbshipit-source-id: d6ab73ec5168c77e328faeef6c6f48eefba00861
2020-01-10 09:48:42 -08:00
c6f41ae01b Fix and add more padding mode support for Conv (#31784)
Summary:
Fix https://github.com/pytorch/pytorch/issues/29712 #29668 , add arg checking, doc, and support for reflection and replication padding modes.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31784

Differential Revision: D19301974

Pulled By: ezyang

fbshipit-source-id: a0ed4815c0c22e416b16e256bba04324e376b2f8
2020-01-10 08:14:58 -08:00
b6f43afaca Fix tensordot allowing negative dims (#31954)
Summary:
fixes https://github.com/pytorch/pytorch/issues/31926
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31954

Differential Revision: D19331847

Pulled By: zou3519

fbshipit-source-id: e30dd9517917c056a52be7d16f23247fe28f4e28
2020-01-10 07:42:04 -08:00
8ea49e7a08 add missing braces for format in rpc _to_worker_info (#31969)
Summary:
This was missing and resulted in the incorrect `name` passed into `_to_worker_info` not being printed out in the error message.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31969

Differential Revision: D19331927

Pulled By: rohan-varma

fbshipit-source-id: e74d47daec3224c2d9b9da3c0a6404cfa67baf65
2020-01-09 23:18:46 -08:00
4e84661139 update llvmlite to 0.30.0 (#31858)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31858

Trying to upgrade docker image but ran into the following error:

```
Running test_nn ... [2020-01-04 18:05:12.537860]
Traceback (most recent call last):
  File "test_nn.py", line 45, in <module>
    from common_cuda import TEST_CUDA, TEST_MULTIGPU, TEST_CUDNN, TEST_CUDNN_VERSION
  File "/var/lib/jenkins/workspace/test/common_cuda.py", line 16, in <module>
    import numba.cuda
  File "/opt/conda/lib/python3.6/site-packages/numba/__init__.py", line 178, in <module>
    _ensure_llvm()
  File "/opt/conda/lib/python3.6/site-packages/numba/__init__.py", line 100, in _ensure_llvm
    raise ImportError(msg)
ImportError: Numba requires at least version 0.30.0 of llvmlite.
Installed version is 0.28.0.
```

Test Plan: Imported from OSS

Differential Revision: D19282923

Pulled By: ljk53

fbshipit-source-id: bdeefbf4f6c0c97df622282f76e77eb1eadba436
2020-01-09 19:28:08 -08:00
62f93443e5 Explain RPC behavior when using Tensor as arg or return value
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/31968

Test Plan: Imported from OSS

Differential Revision: D19321380

Pulled By: mrshenli

fbshipit-source-id: e3431f1f02963cc8d8266a420ab03866106f26ac
2020-01-09 16:42:24 -08:00
6abfa9ad8a Quantized H Tangent function (#31031)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31031

This activation will be needed for the LSTM implementation.
Also includes the QNNPack implementation.

Test Plan: Imported from OSS

Differential Revision: D19334280

Pulled By: z-a-f

fbshipit-source-id: ae14399765a47afdf9b1e072d3967c24ff473e8d
2020-01-09 16:16:17 -08:00
021e1e20c1 Revert D19320493: Javadoc changes
Test Plan: revert-hammer

Differential Revision:
D19320493

Original commit changeset: cc76b2a2acbe

fbshipit-source-id: 3b36dd2d2591acc60a06a421dd625c21adbe578a
2020-01-09 14:23:30 -08:00
700d1c5cbc update CI script to take string docker image version (#31857)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31857

According to mingbowan we will change to use string docker image
version because the tag is no longer an integer since we move the docker
image build job to circle CI:
http://ossci-docker.s3-website.us-east-1.amazonaws.com/pytorch.html

Test Plan: - with stacked PR

Differential Revision: D19282726

Pulled By: ljk53

fbshipit-source-id: 7a12ae89a11cf15163b905734d50fed6dc98cb07
2020-01-09 14:15:10 -08:00
67ff051ddd Remove temporary fix for torchbind in BC check (#31982)
Summary:
Remove the patch
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31982

Reviewed By: hl475

Differential Revision: D19333205

Pulled By: houseroad

fbshipit-source-id: 1d16fd31ede7266789141238520d47b762a7a340
2020-01-09 13:58:16 -08:00
2968faf154 Update doc about output_differentiability keyword in derivatives.yaml
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/31925

Test Plan: Imported from OSS

Differential Revision: D19303833

Pulled By: albanD

fbshipit-source-id: 291a9f122720844a5f8386b22cf6abc66ae86e4d
2020-01-09 13:48:06 -08:00
67c1d930eb Lock graph_task before writing leaf_streams. (#31995)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31995

Fixes #31906.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Differential Revision: D19331259

Pulled By: ezyang

fbshipit-source-id: 5d24bf3555e632211a9b6f8e50ff241603c18b3d
2020-01-09 13:26:36 -08:00
1296e2d55e C++ API parity: isinf (#31099)
Summary:
fixes https://github.com/pytorch/pytorch/issues/31021, port the legacy binding method of `isinf` to C++ therefore support JIT
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31099

Differential Revision: D19314733

Pulled By: yf225

fbshipit-source-id: 5725c51d19c33b4fddd0fc9e7034078580bd534e
2020-01-09 13:16:13 -08:00
cfdfdf70d7 remove JSON dumping dependency (#30724)
Summary:
Fix for https://github.com/pytorch/pytorch/issues/19420

So after actually writing a C++ JSON dumping class I figured that
a faster and cleaner way would be simply rewrite the Python without
the JSON module since the JSON that we need to output is so simple.

For now I decided to not touch the `parse_cpu_trace` function since
only changing `export_chrome_trace` shows a 4x speedup.

Here's the script I used for benchmarking:
``` python
import time
import torch

x = torch.ones(2, 2)

start = time.time()
with torch.autograd.profiler.profile() as prof:
  for _ in range(10000):
    x * x

for i in range(50):
  prof.export_chrome_trace("trace.json")

stop = time.time()

print(stop-start)
```
master branch (using json dump) -> 8.07515025138855
new branch (without json dump) ->  2.0943689346313477

I checked the trace file generated in the [test](https://github.com/pytorch/pytorch/blob/master/test/test_autograd.py#L2659)
and it does work fine.

Please let me know what you think.

If you still insist on the C++ version I can send a new patch soon enough.

CC ezyang rgommers
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30724

Differential Revision: D19298955

Pulled By: ezyang

fbshipit-source-id: b0d7324ea5f90884ab8a00dd272f3aa3d9bc0427
2020-01-09 12:56:16 -08:00
bc68a8745f Spelling fix in transformer docs
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/31973

Differential Revision: D19330660

Pulled By: zou3519

fbshipit-source-id: 29ea1e790a34f0241cb7aba85110f087cdc069ba
2020-01-09 11:13:23 -08:00
26f552a3d1 Javadoc changes (#31956)
Summary:
- Add Javadoc url in index.rst
- Delete no longer needed java rst files
- Remove intersphinx extension from conf.oy
- Remove javasphinx from docs/requirements.txt
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31956

Differential Revision: D19320493

Pulled By: jlin27

fbshipit-source-id: cc76b2a2acbe2ecdabcd3339e1cc3182f0c906ae
2020-01-09 10:55:24 -08:00
e59e5ba5a3 Move geometric to Aten(CPU) (#31878)
Summary:
Fix https://github.com/pytorch/pytorch/issues/24704.
Benchmark script :
```
import torch
import torch.nn as nn
import time

torch.manual_seed(0)

def _time():
    return time.time()

device = "cpu"

#warm up
for n in [10, 100, 1000]:
    input = torch.randn(128, n, requires_grad=False, device=device)
    for i in range(1000):
        input.geometric_(0.5)

for n in [1, 10, 100, 1000]:
    fwd_t = 0
    input = torch.randn(128, n, requires_grad=False, device=device)
    for i in range(10000):
        t1 = _time()
        input.geometric_(0.5)
        t2 = _time()
        fwd_t = fwd_t + (t2 -t1)
    fwd_avg = fwd_t / 10000 * 1000
    print("input size(128, %d) forward time is %.4f (ms)." % (n, fwd_avg))
```
Test device: **skx-8180**.
Before:
```
input size(128, 1) forward time is 0.0092 (ms).
input size(128, 10) forward time is 0.0802 (ms).
input size(128, 100) forward time is 0.7994 (ms).
input size(128, 1000) forward time is 7.8403 (ms).
```
After:
```
input size(128, 1) forward time is 0.0088 (ms).
input size(128, 10) forward time is 0.0781 (ms).
input size(128, 100) forward time is 0.7815 (ms).
input size(128, 1000) forward time is 7.7163 (ms).
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31878

Differential Revision: D19314510

Pulled By: ezyang

fbshipit-source-id: 2d95bf9938c8becf280890acf9e37223ddd08a39
2020-01-09 10:47:56 -08:00
99b3f9cac4 Move log_sigmoid to Aten(CPU) (#30958)
Summary:
VitalyFedyunin, This PR is about port LogSigmoid activation to Aten:
Test script:
```
import torch
import torch.nn as nn
import time

torch.manual_seed(0)

def _time():
    return time.time()

device = "cpu"
m = nn.LogSigmoid()
#warm up
for n in [1, 10, 100, 1000]:
    input = torch.randn(128, n, requires_grad=True, device=device)
    grad_output = torch.randn(128, n, device=device)
    for i in range(1000):
        output = m(input)
        output.backward(grad_output)

for n in [1, 10, 100, 1000]:
    input = torch.randn(128, n, requires_grad=True, device=device)
    grad_output = torch.randn(128, n, device=device)
    fwd_t = 0
    bwd_t = 0
    for i in range(10000):
        t1 = _time()
        output = m(input)
        t2 = _time()
        output.backward(grad_output)
        t3 = _time()
        fwd_t = fwd_t + (t2 -t1)
        bwd_t = bwd_t + (t3 - t2)
    fwd_avg = fwd_t / 10000 * 1000
    bwd_avg = bwd_t / 10000 * 1000
    print("input size(128, %d) forward time is %.2f (ms); backwad avg time is %.2f (ms)."
          % (n, fwd_avg, bwd_avg))
```
**Before:**
```
input size(128, 1) forward time is 0.02 (ms); backwad avg time is 0.02 (ms).
input size(128, 10) forward time is 0.10 (ms); backwad avg time is 0.03 (ms).
input size(128, 100) forward time is 0.90 (ms); backwad avg time is 0.09 (ms).
input size(128, 1000) forward time is 9.04 (ms); backwad avg time is 0.87 (ms).
```
**After:**
```
input size(128, 1) forward time is 0.02 (ms); backwad avg time is 0.02 (ms).
input size(128, 10) forward time is 0.02 (ms); backwad avg time is 0.02 (ms).
input size(128, 100) forward time is 0.04 (ms); backwad avg time is 0.03 (ms).
input size(128, 1000) forward time is 0.28 (ms); backwad avg time is 0.07 (ms).
```
**OMP_NUM_THREADS=1:**
```
Before:
input size(128, 1) forward time is 0.02 (ms); backwad avg time is 0.02 (ms).
input size(128, 10) forward time is 0.10 (ms); backwad avg time is 0.03 (ms).
input size(128, 100) forward time is 0.88 (ms); backwad avg time is 0.10 (ms).
input size(128, 1000) forward time is 8.72 (ms); backwad avg time is 0.81 (ms).
After:
input size(128, 1) forward time is 0.01 (ms); backwad avg time is 0.02 (ms).
input size(128, 10) forward time is 0.02 (ms); backwad avg time is 0.02 (ms).
input size(128, 100) forward time is 0.07 (ms); backwad avg time is 0.03 (ms).
input size(128, 1000) forward time is 0.63 (ms); backwad avg time is 0.15 (ms).
```

Fix https://github.com/pytorch/pytorch/issues/24724, https://github.com/pytorch/pytorch/issues/24725.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30958

Differential Revision: D19275111

Pulled By: ezyang

fbshipit-source-id: bbfe82e58fb27a4fb21c1914c6547a9050072e5c
2020-01-09 10:30:00 -08:00
5a76335aaa Move lshift to Aten (#31566)
Summary:
VitalyFedyunin , this PR is about move lshift to Aten.
Benchmark script :
```
import timeit
import torch
torch.manual_seed(1)

for n, t in [(10, 100000),(1000, 10000)]:
    print('__lshift__ (a.numel() == {}) for {} times'.format(n, t))
    for device in ('cpu', 'cuda'):
        for dtype in ('torch.int8', 'torch.uint8', 'torch.int16', 'torch.int32', 'torch.int64'):
            print(f'device: {device}, dtype: {dtype}, {t} times', end='\t\t')
            print(timeit.timeit(f'a << b\nif "{device}" == "cuda": torch.cuda.synchronize()', setup=f'import torch; a = torch.randint(0, 10, ({n},), dtype = {dtype}, device="{device}"); b = torch.randint(0, 10, ({n},), dtype = {dtype}, device="{device}")', number=t))
        for dtype in ('torch.float32', 'torch.float64'):
            print(f'device: {device}, dtype: {dtype}, {t} times', end='\t\t')
            print(timeit.timeit(f'a << b\nif "{device}" == "cuda": torch.cuda.synchronize()', setup=f'import torch; a = torch.randn({n}, dtype = {dtype}, device="{device}"); b = torch.randn({n}, dtype = {dtype}, device="{device}")', number=t))

for n, t in [(10, 100000),(1000, 10000)]:
    print('__ilshift__ (a.numel() == {}) for {} times'.format(n, t))
    for device in ('cpu', 'cuda'):
        for dtype in ('torch.int8', 'torch.uint8', 'torch.int16', 'torch.int32', 'torch.int64'):
            print(f'device: {device}, dtype: {dtype}, {t} times', end='\t\t')
            print(timeit.timeit(f'a << b\nif "{device}" == "cuda": torch.cuda.synchronize()', setup=f'import torch; a = torch.randint(0, 10, ({n},), dtype = {dtype}, device="{device}"); b = torch.tensor(5, dtype = {dtype}, device="{device}")', number=t))
        for dtype in ('torch.float32', 'torch.float64'):
            print(f'device: {device}, dtype: {dtype}, {t} times', end='\t\t')
            print(timeit.timeit(f'a << b\nif "{device}" == "cuda": torch.cuda.synchronize()', setup=f'import torch; a = torch.randn({n}, dtype = {dtype}, device="{device}"); b = torch.tensor(5, dtype = {dtype}, device="{device}")', number=t))
```
Device: **Tesla P100, skx-8180**
Cuda verison: **9.0.176**

Before:
```
__lshift__ (a.numel() == 10) for 100000 times
device: cpu, dtype: torch.int8, 100000 times            0.31618343852460384
device: cpu, dtype: torch.uint8, 100000 times           0.31258584931492805
device: cpu, dtype: torch.int16, 100000 times           0.3140896391123533
device: cpu, dtype: torch.int32, 100000 times           0.34389012958854437
device: cpu, dtype: torch.int64, 100000 times           0.339566046372056
device: cpu, dtype: torch.float32, 100000 times         0.4180623721331358
device: cpu, dtype: torch.float64, 100000 times         0.4165227338671684
device: cuda, dtype: torch.int8, 100000 times           1.7851383443921804
device: cuda, dtype: torch.uint8, 100000 times          1.7842160519212484
device: cuda, dtype: torch.int16, 100000 times          1.789359962567687
device: cuda, dtype: torch.int32, 100000 times          1.7822618428617716
device: cuda, dtype: torch.int64, 100000 times          1.7968465769663453
device: cuda, dtype: torch.float32, 100000 times                1.8066061967983842
device: cuda, dtype: torch.float64, 100000 times                1.8046843251213431
__lshift__ (a.numel() == 1000) for 10000 times
device: cpu, dtype: torch.int8, 10000 times             0.04618230368942022
device: cpu, dtype: torch.uint8, 10000 times            0.04634759668260813
device: cpu, dtype: torch.int16, 10000 times            0.040676115080714226
device: cpu, dtype: torch.int32, 10000 times            0.04404774494469166
device: cpu, dtype: torch.int64, 10000 times            0.04511771444231272
device: cpu, dtype: torch.float32, 10000 times          0.6887832451611757
device: cpu, dtype: torch.float64, 10000 times          0.5559549620375037
device: cuda, dtype: torch.int8, 10000 times            0.17996764183044434
device: cuda, dtype: torch.uint8, 10000 times           0.17970609478652477
device: cuda, dtype: torch.int16, 10000 times           0.17873135022819042
device: cuda, dtype: torch.int32, 10000 times           0.1781835313886404
device: cuda, dtype: torch.int64, 10000 times           0.17846618220210075
device: cuda, dtype: torch.float32, 10000 times         0.18056879844516516
device: cuda, dtype: torch.float64, 10000 times         0.18132662680000067
__ilshift__ (a.numel() == 10) for 100000 times
device: cpu, dtype: torch.int8, 100000 times            0.61110960226506
device: cpu, dtype: torch.uint8, 100000 times           0.6333359787240624
device: cpu, dtype: torch.int16, 100000 times           0.6345370784401894
device: cpu, dtype: torch.int32, 100000 times           0.6470990972593427
device: cpu, dtype: torch.int64, 100000 times           0.6587044578045607
device: cpu, dtype: torch.float32, 100000 times         0.7269002720713615
device: cpu, dtype: torch.float64, 100000 times         0.7217964073643088
device: cuda, dtype: torch.int8, 100000 times           1.9880435159429908
device: cuda, dtype: torch.uint8, 100000 times          1.986489498987794
device: cuda, dtype: torch.int16, 100000 times          2.0059875370934606
device: cuda, dtype: torch.int32, 100000 times          1.995262237265706
device: cuda, dtype: torch.int64, 100000 times          1.9974954994395375
device: cuda, dtype: torch.float32, 100000 times                2.00442770216614
device: cuda, dtype: torch.float64, 100000 times                2.009664717130363
__ilshift__ (a.numel() == 1000) for 10000 times
device: cpu, dtype: torch.int8, 10000 times             0.08199594635516405
device: cpu, dtype: torch.uint8, 10000 times            0.08096733782440424
device: cpu, dtype: torch.int16, 10000 times            0.0734213450923562
device: cpu, dtype: torch.int32, 10000 times            0.0769620593637228
device: cpu, dtype: torch.int64, 10000 times            0.08650507684797049
device: cpu, dtype: torch.float32, 10000 times          0.7196345143020153
device: cpu, dtype: torch.float64, 10000 times          0.597336508333683
device: cuda, dtype: torch.int8, 10000 times            0.19723015930503607
device: cuda, dtype: torch.uint8, 10000 times           0.19754122477024794
device: cuda, dtype: torch.int16, 10000 times           0.19710093270987272
device: cuda, dtype: torch.int32, 10000 times           0.19611249305307865
device: cuda, dtype: torch.int64, 10000 times           0.19750046730041504
device: cuda, dtype: torch.float32, 10000 times         0.19680574722588062
device: cuda, dtype: torch.float64, 10000 times         0.19689027685672045
```
After:
```
__lshift__ (a.numel() == 10) for 100000 times
device: cpu, dtype: torch.int8, 100000 times            0.3031281465664506
device: cpu, dtype: torch.uint8, 100000 times           0.30772678554058075
device: cpu, dtype: torch.int16, 100000 times           0.3088294789195061
device: cpu, dtype: torch.int32, 100000 times           0.30907699652016163
device: cpu, dtype: torch.int64, 100000 times           0.31315001379698515
device: cpu, dtype: torch.float32, 100000 times         0.38823566399514675
device: cpu, dtype: torch.float64, 100000 times         0.39300001971423626
device: cuda, dtype: torch.int8, 100000 times           1.3225595457479358
device: cuda, dtype: torch.uint8, 100000 times          1.31739442050457
device: cuda, dtype: torch.int16, 100000 times          1.3198596313595772
device: cuda, dtype: torch.int32, 100000 times          1.309600466862321
device: cuda, dtype: torch.int64, 100000 times          1.3264533821493387
device: cuda, dtype: torch.float32, 100000 times                1.3377520674839616
device: cuda, dtype: torch.float64, 100000 times                1.3343619462102652
__lshift__ (a.numel() == 1000) for 10000 times
device: cpu, dtype: torch.int8, 10000 times             0.02718757465481758
device: cpu, dtype: torch.uint8, 10000 times            0.02701799664646387
device: cpu, dtype: torch.int16, 10000 times            0.025483975186944008
device: cpu, dtype: torch.int32, 10000 times            0.025557605549693108
device: cpu, dtype: torch.int64, 10000 times            0.026179466396570206
device: cpu, dtype: torch.float32, 10000 times          0.0962932649999857
device: cpu, dtype: torch.float64, 10000 times          0.1611471576616168
device: cuda, dtype: torch.int8, 10000 times            0.13165222201496363
device: cuda, dtype: torch.uint8, 10000 times           0.13358880020678043
device: cuda, dtype: torch.int16, 10000 times           0.1342075066640973
device: cuda, dtype: torch.int32, 10000 times           0.1328689968213439
device: cuda, dtype: torch.int64, 10000 times           0.13336248509585857
device: cuda, dtype: torch.float32, 10000 times         0.1345295710489154
device: cuda, dtype: torch.float64, 10000 times         0.14084953162819147
__ilshift__ (a.numel() == 10) for 100000 times
device: cpu, dtype: torch.int8, 100000 times            0.19080814253538847
device: cpu, dtype: torch.uint8, 100000 times           0.18541878275573254
device: cpu, dtype: torch.int16, 100000 times           0.19136024825274944
device: cpu, dtype: torch.int32, 100000 times           0.1916898973286152
device: cpu, dtype: torch.int64, 100000 times           0.1973192635923624
device: cpu, dtype: torch.float32, 100000 times         0.2668355852365494
device: cpu, dtype: torch.float64, 100000 times         0.24472137168049812
device: cuda, dtype: torch.int8, 100000 times           1.3581306440755725
device: cuda, dtype: torch.uint8, 100000 times          1.3522163443267345
device: cuda, dtype: torch.int16, 100000 times          1.366145665757358
device: cuda, dtype: torch.int32, 100000 times          1.3674909211695194
device: cuda, dtype: torch.int64, 100000 times          1.3734915973618627
device: cuda, dtype: torch.float32, 100000 times                1.3831533305346966
device: cuda, dtype: torch.float64, 100000 times                1.396162535995245
__ilshift__ (a.numel() == 1000) for 10000 times
device: cpu, dtype: torch.int8, 10000 times             0.02847585454583168
device: cpu, dtype: torch.uint8, 10000 times            0.02960751298815012
device: cpu, dtype: torch.int16, 10000 times            0.028516249731183052
device: cpu, dtype: torch.int32, 10000 times            0.02842544950544834
device: cpu, dtype: torch.int64, 10000 times            0.029186096973717213
device: cpu, dtype: torch.float32, 10000 times          0.0999628696590662
device: cpu, dtype: torch.float64, 10000 times          0.16676222812384367
device: cuda, dtype: torch.int8, 10000 times            0.13856443110853434
device: cuda, dtype: torch.uint8, 10000 times           0.13766566663980484
device: cuda, dtype: torch.int16, 10000 times           0.13652489613741636
device: cuda, dtype: torch.int32, 10000 times           0.13678150344640017
device: cuda, dtype: torch.int64, 10000 times           0.13749946560710669
device: cuda, dtype: torch.float32, 10000 times         0.13879029918462038
device: cuda, dtype: torch.float64, 10000 times         0.14587809145450592
```

Fix https://github.com/pytorch/pytorch/issues/24510 #24514 https://github.com/pytorch/pytorch/issues/24657  https://github.com/pytorch/pytorch/issues/24661
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31566

Differential Revision: D19314251

Pulled By: ezyang

fbshipit-source-id: 52df17b2c18ef1880374c6dbcf18fb1118086552
2020-01-09 09:41:36 -08:00
5c423cae72 Add precision tests for CUDA half linspace+logspace (#31962)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31962

I added precision tests for CUDA half, float, and double.

The precision for CUDA half seems bad, but I checked the numbers against
previous versions of pytorch. The output of CUDA Half linspace+logspace
are exactly the same when compared with 1.2.0.

Test Plan: - Run CI

Differential Revision: D19320182

Pulled By: zou3519

fbshipit-source-id: 38d3d4dea2807875ed0b0ec2b93b19c10a289988
2020-01-09 07:35:52 -08:00
5d5f156558 Revert D18903453: Quantized H Tangent function
Test Plan: revert-hammer

Differential Revision:
D18903453

Original commit changeset: 0050b1cebb1d

fbshipit-source-id: 205978f71d5688d4068861f7cf2dff40fbb311c6
2020-01-09 07:30:49 -08:00
ddff4efa26 Don't use RTLD_GLOBAL to load _C. (#31162)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31162

This should help us resolve a multitude of weird segfaults and crashes
when PyTorch is imported along with other packages. Those would often
happen because libtorch symbols were exposed globally and could be used
as a source of relocations in shared libraries loaded after libtorch.

Fixes #3059.

Some of the subtleties in preparing this patch:

* Getting ASAN to play ball was a pain in the ass. The basic problem is that when we load with `RTLD_LOCAL`, we now may load a library multiple times into the address space; this happens when we have custom C++ extensions. Since the libraries are usually identical, this is usually benign, but it is technically undefined behavior and UBSAN hates it. I sprayed a few ways of getting things to "work" correctly: I preload libstdc++ (so that it is seen consistently over all library loads) and added turned off vptr checks entirely. Another possibility is we should have a mode where we use RTLD_GLOBAL to load _C, which would be acceptable in environments where you're sure C++ lines up correctly. There's a long comment in the test script going into more detail about this.
* Making some of our shared library dependencies load with `RTLD_LOCAL` breaks them. OpenMPI and MKL don't work; they play linker shenanigans to look up their symbols which doesn't work when loaded locally, and if we load a library with `RLTD_LOCAL` we aren't able to subsequently see it with `ctypes`. To solve this problem, we employ a clever device invented by apaszke: we create a dummy library `torch_global_deps` with dependencies on all of the libraries which need to be loaded globally, and then load that with `RTLD_GLOBAL`. As long as none of these libraries have C++ symbols, we can avoid confusion about C++ standard library.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Differential Revision: D19262579

Test Plan: Imported from OSS

Pulled By: ezyang

fbshipit-source-id: 06a48a5d2c9036aacd535f7e8a4de0e8fe1639f2
2020-01-09 07:28:15 -08:00
8614860210 Uniformly apply Windows logic in cpp_extensions everywhere (#31161)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31161

Previously, it wasn't necessary to specify `DT_NEEDED` in C++ extensions on Linux (aka pass `-l` flags) because all of the symbols would have already been loaded with `RTLD_GLOBAL`, so there wouldn't be any undefined symbols.  But when we switch to loading `_C` with `RTLD_LOCAL`, it's now necessary for all the C++ extensions to know what libraries to link with. The resulting code is clearer and more uniform, so it's wins all around.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Differential Revision: D19262578

Pulled By: ezyang

fbshipit-source-id: a893cc96f2e9aad1c064a6de4f7ccf79257dec3f
2020-01-09 07:28:11 -08:00
0dbd5c0bfe Added torchvision tests as part of ORT tests (#31835)
Summary:
Added torchvision tests as part of ORT tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31835

Reviewed By: hl475

Differential Revision: D19278607

Pulled By: houseroad

fbshipit-source-id: 18a6a85ce3019bcc9aee9517af1378964b585afd
2020-01-08 21:04:29 -08:00
6d9a9e379d Fix segfault in caffe2 slice test (#31801)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31801

Try to fix issue #30764

Test Plan:
python test/onnx/test_utility_funs.py TestUtilityFuns

Imported from OSS

Differential Revision: D19315046

fbshipit-source-id: de3595969280e4ebe762cb098ff0891f8b5a9a90
2020-01-08 17:13:29 -08:00
9e9ca6ec37 add conversion functions to embedding tables (#31083)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31083

add (fp32/fp16)<->(int8 rowwise quantized fp32/fp16 scale biases)

Test Plan:
added unit tests
enhanced shape inference tests

Reviewed By: jspark1105

Differential Revision: D18920547

fbshipit-source-id: 6b3d7cb93f9d1669ecf511817d73976177632891
2020-01-08 16:56:12 -08:00
eb23171bce TensorIterator norm update (#31903)
Summary:
special case for norm out where p == 2. Instead of calling `pow`,
we use multiplication as a faster code path.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31903

Differential Revision: D19312749

Pulled By: ngimel

fbshipit-source-id: 73732b7b37a243a14438609784795b920271a0b5
2020-01-08 16:50:42 -08:00
8ecd3f783d check for object equality in constant pooling (#31800)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31800

If we know that two constants are the same object, we can ignore other constraints and pool them together. This fixes an issue introduced by the other PR where quantization relied on constant pooling happening for correctness.

Test Plan: Imported from OSS

Differential Revision: D19269499

Pulled By: eellison

fbshipit-source-id: 9d4396125aa6899cb081863d463d4f024135cbf4
2020-01-08 16:47:07 -08:00
319cc21108 Add AliasDb API For Changing Aliasing (#31501)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31501

We have a number of places in our code base where we should be checking if it's safe to change the alias relationship between two sets of values. This PR adds an api to Alias Db to consolidate the logic, and refactors Constant Pooling and `CSE` to use the new api. Next steps: add api usage in peephole.cpp where applicable.

Happy to bikeshed `AliasDb::safeToChangeAliasingRelationship`. Previously I suggested `AliasDb::safeToIntroduceAliasing`, however that's not quite accurate, because this API also handles when it is unsafe to remove aliasing.

Alternate suggestions: `safeToChangeAliasing`, `validToChangeAliasing`, `validToChangeAliasingRelationship`

Related:  https://github.com/pytorch/pytorch/issues/28360

Test Plan: Imported from OSS

Differential Revision: D19254413

Pulled By: eellison

fbshipit-source-id: 17f7f52ad2d1526d303132767cbbb32f8189ae15
2020-01-08 16:47:03 -08:00
5cc49ed45f Document IValue (#31904)
Summary:
This is a first pass attempt at documenting `IValue` to help with problems like in #17165. Most users are probably concerned with
 * how to make an `IValue` that matches the input type to their graph (most of the constructors are pretty self explanatory, so as long as they are in the docs I think its enough)
 * how to extract the results after running their graph (there is a small note on the behavior of `.toX()` based on confusions we've had in the past)

Preview:
https://driazati.github.io/pytorch_doc_previews/31904/api/structc10_1_1_i_value.html#exhale-struct-structc10-1-1-i-value

There are also some random CSS fixes to clean up the style.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31904

Pulled By: driazati

Differential Revision: D19318733

fbshipit-source-id: b29dae3349d5a7ea5a3b8e09cd23f7ff8434edb4
2020-01-08 16:08:35 -08:00
883fb5434a Use real argument names for Python functions (#29300)
Summary:
This hooks up `inspect` so that Python functions get their parameters
names attached instead of naming them `0, 1, 2, ...`. This also fixes
issue #28537 where `ignore` functions were improperly typing `self`.
](https://our.intern.facebook.com/intern/diff/19256434/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/29300

Pulled By: driazati

Differential Revision: D19256434

fbshipit-source-id: 6a1fe7bd0afab708b8439517798955d0abfeb44c
2020-01-08 15:41:28 -08:00
09a22f3301 Remove C++ docs contributing page (#31908)
Summary:
Stacked PRs
 * **#31908 - Remove C++ docs contributing page**
 * #31905 - Add doc previewing instructions

We should have 1 source of truth for contribution instructions (CONTRIBUTING.md).
This PR moves the instructions from the C++ doc pages there instead of having its
own separate page.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31908

Pulled By: driazati

Differential Revision: D19296366

fbshipit-source-id: c1daf004259342bd09e09dea3b80e34db47066ec
2020-01-08 15:37:35 -08:00
8c59d48281 Add doc previewing instructions (#31905)
Summary:
Stacked PRs
 * #31908 - Remove C++ docs contributing page
 * **#31905 - Add doc previewing instructions**

This adds some instructions on how to get started with Github pages you can show reviewers your documentation changes. Hopefully we can delete this eventually and build docs automatically on relevant PRs in CI.
](https://our.intern.facebook.com/intern/diff/19296364/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31905

Pulled By: driazati

Differential Revision: D19296364

fbshipit-source-id: df47fa1a8d7be029c3efcf6521298583ad9f7a95
2020-01-08 15:37:31 -08:00
dedd16b418 remove THConv code which never be used (#31879)
Summary:
Just remove dead code in TH.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31879

Differential Revision: D19315818

Pulled By: ezyang

fbshipit-source-id: dbeb2475e19e9ebf769df2649cc859c08d3d184d
2020-01-08 15:14:27 -08:00
9a3cb1e859 Move cauchy to Aten(CPU) (#31824)
Summary:
Fix https://github.com/pytorch/pytorch/issues/24684.
Benchmark script :
```
import torch
import torch.nn as nn
import time

torch.manual_seed(0)

def _time():
    return time.time()

device = "cpu"

#warm up
for n in [10, 100, 1000]:
    input = torch.randn(128, n, requires_grad=False, device=device)
    for i in range(1000):
        input.cauchy_()

for n in [1, 10, 100, 1000]:
    fwd_t = 0
    input = torch.randn(128, n, requires_grad=False, device=device)
    for i in range(10000):
        t1 = _time()
        input.cauchy_()
        t2 = _time()
        fwd_t = fwd_t + (t2 -t1)
    fwd_avg = fwd_t / 10000 * 1000
    print("input size(128, %d) forward time is %.4f (ms)." % (n, fwd_avg))
```
Test device: **skx-8180**.
Before:
```
input size(128, 1) forward time is 0.0071 (ms).
input size(128, 10) forward time is 0.0596 (ms).
input size(128, 100) forward time is 0.5798 (ms).
input size(128, 1000) forward time is 5.8395 (ms).
```
After:
```
input size(128, 1) forward time is 0.0070 (ms).
input size(128, 10) forward time is 0.0583 (ms).
input size(128, 100) forward time is 0.5714 (ms).
input size(128, 1000) forward time is 5.7674 (ms).
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31824

Differential Revision: D19314411

Pulled By: ezyang

fbshipit-source-id: 58098546face3e5971b023f702cfe44ff1cccfbc
2020-01-08 15:10:53 -08:00
9ba6a768de Add op bitwise_or (#31559)
Summary:
ezyang ,  this PR add bitwise_or operator as https://github.com/pytorch/pytorch/pull/31104 .
Benchmark script :
```
import timeit
import torch
torch.manual_seed(1)

for n, t in [(10, 100000),(1000, 10000)]:
    print('__or__ (a.numel() == {}) for {} times'.format(n, t))
    for device in ('cpu', 'cuda'):
        for dtype in ('torch.int8', 'torch.uint8', 'torch.int16', 'torch.int32', 'torch.int64'):
            print(f'device: {device}, dtype: {dtype}, {t} times', end='\t\t')
            print(timeit.timeit(f'a | b\nif "{device}" == "cuda": torch.cuda.synchronize()', setup=f'import torch; a = torch.randint(0, 10, ({n},), dtype = {dtype}, device="{device}"); b = torch.randint(0, 10, ({n},), dtype = {dtype}, device="{device}")', number=t))

for n, t in [(10, 100000),(1000, 10000)]:
    print('__ior__ (a.numel() == {}) for {} times'.format(n, t))
    for device in ('cpu', 'cuda'):
        for dtype in ('torch.int8', 'torch.uint8', 'torch.int16', 'torch.int32', 'torch.int64'):
            print(f'device: {device}, dtype: {dtype}, {t} times', end='\t\t')
            print(timeit.timeit(f'a | b\nif "{device}" == "cuda": torch.cuda.synchronize()', setup=f'import torch; a = torch.randint(0, 10, ({n},), dtype = {dtype}, device="{device}"); b = torch.tensor(5, dtype = {dtype}, device="{device}")', number=t))
```
Device: **Tesla P100, skx-8180**
Cuda verison: **9.0.176**

Before:
```
__or__ (a.numel() == 10) for 100000 times
device: cpu, dtype: torch.int8, 100000 times            0.17616272252053022
device: cpu, dtype: torch.uint8, 100000 times           0.17148233391344547
device: cpu, dtype: torch.int16, 100000 times           0.17616403382271528
device: cpu, dtype: torch.int32, 100000 times           0.17717823758721352
device: cpu, dtype: torch.int64, 100000 times           0.1801931718364358
device: cuda, dtype: torch.int8, 100000 times           1.270583058707416
device: cuda, dtype: torch.uint8, 100000 times          1.2636413089931011
device: cuda, dtype: torch.int16, 100000 times          1.2839747751131654
device: cuda, dtype: torch.int32, 100000 times          1.2548385225236416
device: cuda, dtype: torch.int64, 100000 times          1.2650810535997152
__or__ (a.numel() == 1000) for 10000 times
device: cpu, dtype: torch.int8, 10000 times             0.031136621721088886
device: cpu, dtype: torch.uint8, 10000 times            0.030786747112870216
device: cpu, dtype: torch.int16, 10000 times            0.02391665056347847
device: cpu, dtype: torch.int32, 10000 times            0.024147341027855873
device: cpu, dtype: torch.int64, 10000 times            0.024414129555225372
device: cuda, dtype: torch.int8, 10000 times            0.12741921469569206
device: cuda, dtype: torch.uint8, 10000 times           0.1249831635504961
device: cuda, dtype: torch.int16, 10000 times           0.1283819805830717
device: cuda, dtype: torch.int32, 10000 times           0.12591975275427103
device: cuda, dtype: torch.int64, 10000 times           0.12655890546739101
__ior__ (a.numel() == 10) for 100000 times
device: cpu, dtype: torch.int8, 100000 times            0.3908365070819855
device: cpu, dtype: torch.uint8, 100000 times           0.38267823681235313
device: cpu, dtype: torch.int16, 100000 times           0.38239253498613834
device: cpu, dtype: torch.int32, 100000 times           0.3817988149821758
device: cpu, dtype: torch.int64, 100000 times           0.3901665909215808
device: cuda, dtype: torch.int8, 100000 times           1.4211318120360374
device: cuda, dtype: torch.uint8, 100000 times          1.4215159295126796
device: cuda, dtype: torch.int16, 100000 times          1.4307750314474106
device: cuda, dtype: torch.int32, 100000 times          1.4123614141717553
device: cuda, dtype: torch.int64, 100000 times          1.4480243818834424
__ior__ (a.numel() == 1000) for 10000 times
device: cpu, dtype: torch.int8, 10000 times             0.06468924414366484
device: cpu, dtype: torch.uint8, 10000 times            0.06442475505173206
device: cpu, dtype: torch.int16, 10000 times            0.05267547257244587
device: cpu, dtype: torch.int32, 10000 times            0.05286940559744835
device: cpu, dtype: torch.int64, 10000 times            0.06211103219538927
device: cuda, dtype: torch.int8, 10000 times            0.15332304500043392
device: cuda, dtype: torch.uint8, 10000 times           0.15353196952492
device: cuda, dtype: torch.int16, 10000 times           0.15300503931939602
device: cuda, dtype: torch.int32, 10000 times           0.15274472255259752
device: cuda, dtype: torch.int64, 10000 times           0.1512152962386608
```
After:
```
__or__ (a.numel() == 10) for 100000 times
device: cpu, dtype: torch.int8, 100000 times            0.2465507509186864
device: cpu, dtype: torch.uint8, 100000 times           0.2472386620938778
device: cpu, dtype: torch.int16, 100000 times           0.2469814233481884
device: cpu, dtype: torch.int32, 100000 times           0.2535214088857174
device: cpu, dtype: torch.int64, 100000 times           0.24855613708496094
device: cuda, dtype: torch.int8, 100000 times           1.4351346511393785
device: cuda, dtype: torch.uint8, 100000 times          1.4434308474883437
device: cuda, dtype: torch.int16, 100000 times          1.4520929995924234
device: cuda, dtype: torch.int32, 100000 times          1.4456610176712275
device: cuda, dtype: torch.int64, 100000 times          1.4580101007595658
__or__ (a.numel() == 1000) for 10000 times
device: cpu, dtype: torch.int8, 10000 times             0.029985425993800163
device: cpu, dtype: torch.uint8, 10000 times            0.03024935908615589
device: cpu, dtype: torch.int16, 10000 times            0.026356655173003674
device: cpu, dtype: torch.int32, 10000 times            0.027377349324524403
device: cpu, dtype: torch.int64, 10000 times            0.029163731262087822
device: cuda, dtype: torch.int8, 10000 times            0.14540370367467403
device: cuda, dtype: torch.uint8, 10000 times           0.1456305105239153
device: cuda, dtype: torch.int16, 10000 times           0.1450125053524971
device: cuda, dtype: torch.int32, 10000 times           0.1472016740590334
device: cuda, dtype: torch.int64, 10000 times           0.14709716010838747
__ior__ (a.numel() == 10) for 100000 times
device: cpu, dtype: torch.int8, 100000 times            0.27195510920137167
device: cpu, dtype: torch.uint8, 100000 times           0.2692424338310957
device: cpu, dtype: torch.int16, 100000 times           0.27726674638688564
device: cpu, dtype: torch.int32, 100000 times           0.2815811652690172
device: cpu, dtype: torch.int64, 100000 times           0.2852728571742773
device: cuda, dtype: torch.int8, 100000 times           1.4743850827217102
device: cuda, dtype: torch.uint8, 100000 times          1.4766502184793353
device: cuda, dtype: torch.int16, 100000 times          1.4774163831025362
device: cuda, dtype: torch.int32, 100000 times          1.4749693805351853
device: cuda, dtype: torch.int64, 100000 times          1.5772947426885366
__ior__ (a.numel() == 1000) for 10000 times
device: cpu, dtype: torch.int8, 10000 times             0.03614502027630806
device: cpu, dtype: torch.uint8, 10000 times            0.03619729354977608
device: cpu, dtype: torch.int16, 10000 times            0.0319912089034915
device: cpu, dtype: torch.int32, 10000 times            0.03319283854216337
device: cpu, dtype: torch.int64, 10000 times            0.0343862259760499
device: cuda, dtype: torch.int8, 10000 times            0.1581476852297783
device: cuda, dtype: torch.uint8, 10000 times           0.15974601730704308
device: cuda, dtype: torch.int16, 10000 times           0.15957212820649147
device: cuda, dtype: torch.int32, 10000 times           0.16002820804715157
device: cuda, dtype: torch.int64, 10000 times           0.16129320487380028
```

Fix  https://github.com/pytorch/pytorch/issues/24511, https://github.com/pytorch/pytorch/issues/24515, https://github.com/pytorch/pytorch/issues/24658, https://github.com/pytorch/pytorch/issues/24662.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31559

Differential Revision: D19315875

Pulled By: ezyang

fbshipit-source-id: 4a3ca88fdafbeb796079687e676228111eb44aad
2020-01-08 15:06:30 -08:00
4f9d2f74e2 Port softplus activation to Aten(CPU+CUDA) (#30504)
Summary:
VitalyFedyunin, This PR is about port Softplus activation to Aten:
**Test script:**
```
import torch
import torch.nn as nn
import time

torch.manual_seed(0)
def _time():
    if torch.cuda.is_available():
        torch.cuda.synchronize()
    return time.time()

device = "cpu"
m = nn.Softplus()
if torch.cuda.is_available():
    device = "cuda"
    m = m.cuda()

#warm up
for n in [100, 10000]:
    input = torch.randn(128, n, requires_grad=True, device=device)
    grad_output = torch.ones(128, n, device=device)
    for i in range(1000):
        output = m(input)
        output.backward(grad_output)

for n in [100, 10000]:
    input = torch.randn(128, n, requires_grad=True, device=device)
    grad_output = torch.ones(128, n, device=device)
    fwd_t = 0
    bwd_t = 0
    for i in range(10000):
        t1 = _time()
        output = m(input)
        t2 = _time()
        output.backward(grad_output)
        t3 = _time()
        fwd_t = fwd_t + (t2 -t1)
        bwd_t = bwd_t + (t3 - t2)
    fwd_avg = fwd_t / 10000 * 1000
    bwd_avg = bwd_t / 10000 * 1000
    print("input size(128, %d) forward time is %.2f (ms); backwad avg time is %.2f (ms)."
          % (n, fwd_avg, bwd_avg))
```
Test Device: CPU: skx-8180, GPU: Tesla P40.
Perfromance:
Before:
```
GPU:
input size(128, 100) forward time is 0.06 (ms); backwad avg time is 0.12 (ms).
input size(128, 10000) forward time is 0.06 (ms); backwad avg time is 0.18 (ms).
CPU:
input size(128, 100) forward time is 1.16 (ms); backwad avg time is 0.69 (ms).
input size(128, 10000) forward time is 60.19 (ms); backwad avg time is 31.86 (ms).
```
After:
```
GPU:
input size(128, 100) forward time is 0.05 (ms); backwad avg time is 0.11 (ms).
input size(128, 10000) forward time is 0.06 (ms); backwad avg time is 0.17 (ms).
CPU:
input size(128, 100) forward time is 0.43 (ms); backwad avg time is 0.16 (ms).
input size(128, 10000) forward time is 1.65 (ms); backwad avg time is 0.83 (ms).
```
`OMP_NUM_THREADS=1:`
```
Before:
input size(128, 100) forward time is 0.53 (ms); backwad avg time is 0.28 (ms).
input size(128, 10000) forward time is 51.33 (ms); backwad avg time is 25.48 (ms).
After:
input size(128, 100) forward time is 0.44 (ms); backwad avg time is 0.16 (ms).
input size(128, 10000) forward time is 42.05 (ms); backwad avg time is 13.97 (ms).
```

Fix https://github.com/pytorch/pytorch/issues/24633, https://github.com/pytorch/pytorch/issues/24634, https://github.com/pytorch/pytorch/issues/24766, https://github.com/pytorch/pytorch/issues/24767.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30504

Differential Revision: D19274913

Pulled By: ezyang

fbshipit-source-id: 21b29e8459dcba5a040cc68333887b45a858328e
2020-01-08 15:03:53 -08:00
d2fdf140af Combine all the user inputs together and convert them to fp16 (#31898)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31898

Att

Reviewed By: tracelogfb

Differential Revision: D19291357

fbshipit-source-id: 747ed5234ca042ceeaff2d094701ead7597ac3ee
2020-01-08 14:36:42 -08:00
8b4feff01d Use simd version for fp16 conversions (#31897)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31897

Previous version only use avx2. The _simd version uses avx512 if CPU is capable of that.

Test Plan: Unitttest

Reviewed By: tracelogfb

Differential Revision: D19291499

fbshipit-source-id: 3b1ee0ba756e5c9defbd5caf7f68982d9b2ca06c
2020-01-08 14:36:38 -08:00
1314f7f4f4 Ensure the original grad_mode is restored during backward (#31884)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31884

Fix #31715

Test Plan: Imported from OSS

Differential Revision: D19301076

Pulled By: albanD

fbshipit-source-id: 2d20c01bfb6364fa96c8fe5aa5ce7ea39defa3ce
2020-01-08 14:16:51 -08:00
c299cb05ef temporary fix for jit test backward compatibility issues
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/31949

Test Plan: Imported from OSS

Differential Revision: D19314763

Pulled By: albanD

fbshipit-source-id: b5eff0ed53a371d260596ca85d914c8bddb0a8aa
2020-01-08 13:32:08 -08:00
462bfc7fe7 docker hub image info (#31923)
Summary:
result: http://docker.pytorch.org/docker_hub.html
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31923

Differential Revision: D19316770

Pulled By: mingbowan

fbshipit-source-id: 57f34d8983d26772bb0d310fa0a4085674c860e5
2020-01-08 13:20:06 -08:00
5dfcfeebb8 Revert D19298735: Emit warning from deprecated torch function signatures
Test Plan: revert-hammer

Differential Revision:
D19298735

Original commit changeset: 03cb78af1765

fbshipit-source-id: 304a6d4412f53a8fc822d36897c96815432e0f70
2020-01-08 13:04:41 -08:00
620060cb0c Quantized H Tangent function (#31031)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31031

This activation will be needed for the LSTM implementation.
Also includes the QNNPack implementation.

Test Plan: Imported from OSS

Differential Revision: D18903453

Pulled By: z-a-f

fbshipit-source-id: 0050b1cebb1ddb179b7ecbcb114fe70705070f67
2020-01-08 12:59:39 -08:00
54777b1e73 Avoid reference invalidation in cuda SpectralOps' plan_caches (#31861)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/31412

The root cause is `plan_caches` being resized in one thread while another holds a reference to an existing `CuFFTParamsLRUCache` which then becomes invalidated.

I was able to reproduce the crash very reliably without this fix applied and no longer see it. Being a race condition, it's hard to say for sure though.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31861

Differential Revision: D19312314

Pulled By: ezyang

fbshipit-source-id: 06e4561128d503f2d70cdfe1982be0f3db2a8cf8
2020-01-08 11:50:05 -08:00
7f723cbd8a Revert D19290954: Implement backend-agnostic rpc._wait_all_workers() utility
Test Plan: revert-hammer

Differential Revision:
D19290954

Original commit changeset: cdb22203c2f2

fbshipit-source-id: 2ae194a06a645e4f48879271eccf0588b0956cd3
2020-01-08 10:25:51 -08:00
c66ca74f03 Add device debug info to CUDA build (#31929)
Summary:
Also print NVCC flags in the summary
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31929

Differential Revision: D19312079

Pulled By: ezyang

fbshipit-source-id: cd20d5a385f61174c1907a9ad883c04de66ef037
2020-01-08 09:56:20 -08:00
f0072b3af5 Remove C++11 compatibility from c10::optional (#30919)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30919

deletecode
ghstack-source-id: 96383227

Test Plan: waitforsandcastle

Differential Revision: D18869641

fbshipit-source-id: c08345d17a291cea3749af20473b6acddc78ab27
2020-01-08 09:19:59 -08:00
f67851d69a Fix c10::util::get_fully_qualified_type_name for MSVC (#31313)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31313

This is a bugfix. The reason we couldn't enable the constexpr-ness for it before is that it was buggy,
and without constexpr it crashed at runtime and not at compile time which seems to have passed our CI unfortunately...
ghstack-source-id: 96380160

Test Plan: Now it works even when enabling constexpr for it

Differential Revision: D19087471

fbshipit-source-id: 28be107389f4507d35d08eab4b089a405690529b
2020-01-08 09:11:10 -08:00
2a294aace6 Remove memory ordering from LeftRight (#31026)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31026

This is error prone and probably wrong. Since we don't use LeftRight on the hot path anymore, let's remove this.
ghstack-source-id: 96369644

Test Plan: none

Differential Revision: D18902165

fbshipit-source-id: 7b9478cd7cc071f403d75da20c7c889c27248b5c
2020-01-08 08:59:30 -08:00
84dfa96f62 Fix -Wundef warning in conversions.h
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/31911

Test Plan:
* CI builds including GPU and OSS-build tests
* The `defined(__HIP_DEVICE_COMPILE__) ` instance a few lines below is proof that this is a define/undef flag, not a define01 flag

Reviewed By: hlu1

Differential Revision: D19296560

fbshipit-source-id: 1c45069aec534b0bf4a87751a74680675c985e06
2020-01-08 08:39:37 -08:00
ee817012b2 Add more tests to the autograd wrt view and inplace (#31147)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31147

The goal here is to add more tests of the current behavior of the autograd to make sure no regressions are introduced when modifying it.
Do let me know if you think of other corner cases I missed.

Test Plan: Imported from OSS

Differential Revision: D19301082

Pulled By: albanD

fbshipit-source-id: 2cb07dcf99e56eb1f2c56a179796f2e6042d5a2d
2020-01-08 07:14:52 -08:00
6664703842 Implement backend-agnostic rpc._wait_all_workers() utility (#31888)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31888

We need a backend-agnostic mechanism to do barrier-like operation before locally destroy RRef context and shutdown RPC Agent.

- Sort worker names.
- Elect the first name as the leader in the ordered worker names.
- Followers reports therir intent to synchronize to the leader.
- Leader also reports to itself, when `_wait_all_workers()` called.
- If all workers report their intent to proceed, leader send the command to every one to proceed.
ghstack-source-id: 96386210

Test Plan:
# Unit tests

```
buck test mode/dev-nosan //caffe2/test:rpc_fork

buck-out/gen/caffe2/test/rpc_fork\#binary.par -r test_wait_all_workers
buck-out/gen/caffe2/test/rpc_fork\#binary.par -r test_rref_leak
```

```
buck test mode/dev-nosan //caffe2/test:rpc_fork_thrift

buck-out/gen/caffe2/test/rpc_fork\#binary.par -r test_wait_all_workers
buck-out/gen/caffe2/test/rpc_fork_thrift\#binary.par -r test_worker_id
```

# Stress runs
```
buck test mode/dev-nosan //caffe2/test:rpc_fork_thrift -- test_stress_light_rpc --stress-runs 10
```

```
buck test mode/dev-nosan //caffe2/test:rpc_spawn_thrift -- test_stress_light_rpc --stress-runs 10
```

```
buck test mode/dev-nosan //caffe2/test:rpc_fork_thrift -- test_stress_heavy_rpc --stress-runs 10
```

```
buck test mode/dev-nosan //caffe2/test:rpc_spawn_thrift -- test_stress_heavy_rpc --stress-runs 10
```

Differential Revision: D19290954

fbshipit-source-id: cdb22203c2f27b5e0d0ad5b2d3b279d438c22dcf
2020-01-08 01:00:25 -08:00
9116f02beb Rename TORCH_DCHECK to TORCH_INTERNAL_ASSERT_DEBUG_ONLY (#31917)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31917

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Differential Revision: D19301480

Pulled By: ezyang

fbshipit-source-id: fcce8868733965b9fbd326b4ec273135759df377
2020-01-07 17:28:47 -08:00
ab60cca488 Make c10::util::get_fully_qualified_type_name() backwards compatible with clang 4 (#31351)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31351

Clang 4 needs the c10:: namespace specifier on fully_qualified_type_name_impl() to work correctly.

Also, let's add an error message for people using clang 3 and earlier, we don't support those compilers anymore but before this PR, they got a crappy message.
ghstack-source-id: 96380163

Test Plan: testinprod

Differential Revision: D19135587

fbshipit-source-id: c206b56240b36e5c207fb2b69c389bb39f1e62aa
2020-01-07 17:07:54 -08:00
0dca9c30ca constexpr typeid improvements (#31312)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31312

ghstack-source-id: 96369343

Test Plan: unit tests

Differential Revision: D19087198

fbshipit-source-id: 7f9a7169f11973759b9ecabcc755c211d34e2742
2020-01-07 17:07:49 -08:00
c21f89970f Remove c++14-conditional constexpr (#30916)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30916

These macros said "make it constexpr if we're in C++14". Since we're now always C++14, we can just say "constexpr" isntead.
ghstack-source-id: 96369584

Test Plan: waitforsandcastle

Differential Revision: D18869635

fbshipit-source-id: f41751e4e26fad6214ec3a98db2d961315fd73ff
2020-01-07 16:40:11 -08:00
4daa3dedbe Fix IValue.isList
Summary: I think this was wrong before?

Test Plan: Not sure.

Reviewed By: IvanKobzarev

Differential Revision: D19221358

fbshipit-source-id: 27e675cac15dde29e026305f4b4e6cc774e15767
2020-01-07 16:33:36 -08:00
1b4d3d5748 Properly return data from non-contiguous tensors in Java
Summary:
These were returning incorrect data before.  Now we make a contiguous copy
before converting to Java.  Exposing raw data to the user might be faster in
some cases, but it's not clear that it's worth the complexity and code size.

Test Plan: New unit test.

Reviewed By: IvanKobzarev

Differential Revision: D19221361

fbshipit-source-id: 22ecdad252c8fd968f833a2be5897c5ae483700c
2020-01-07 16:33:31 -08:00
2d6a2c898c Support tensors with a storage offset in Java (#31584)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31584

These were returning incorrect data before.

Test Plan: New unit test.

Reviewed By: IvanKobzarev

Differential Revision: D19221360

fbshipit-source-id: b3f01de086857027f8e952a1c739f60814a57acd
2020-01-07 16:33:26 -08:00
6d1fa8296b Support tensors with empty shape in Java
Summary: These are valid tensors.

Test Plan: New unit test.

Reviewed By: IvanKobzarev

Differential Revision: D19221362

fbshipit-source-id: fa9af2fc539eb7381627b3d473241a89859ef2ba
2020-01-07 16:33:21 -08:00
3c07eb33bb Better error for torch::jit::loading a eager file (#31709)
Summary:
This adds a check to catch the case where someone `torch.save`s something then `torch::jit::load`s it in C++.

Relevant for #31620
](https://our.intern.facebook.com/intern/diff/19252172/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31709

Pulled By: driazati

Differential Revision: D19252172

fbshipit-source-id: f2a9b4442647285418b2778306629b4ff77c15e5
2020-01-07 16:20:42 -08:00
a730920a3d Make RRef leak detection always print a warning log (#31922)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31922

For better debugging, `test_rref_leak` failure in https://app.circleci.com/jobs/github/pytorch/pytorch/4135881, as per discussion in https://github.com/pytorch/pytorch/pull/31888.

ghstack-source-id: 96375261

Test Plan:
# Unit tests

```
buck test mode/dev-nosan //caffe2/test:rpc_fork

buck-out/gen/caffe2/test/rpc_fork\#binary.par -r test_wait_all_workers
buck-out/gen/caffe2/test/rpc_fork\#binary.par -r test_rref_leak
```

# Stress runs
```
buck test mode/dev-nosan //caffe2/test:rpc_fork_thrift -- test_stress_light_rpc --stress-runs 10
```

```
buck test mode/dev-nosan //caffe2/test:rpc_spawn_thrift -- test_stress_light_rpc --stress-runs 10
```

```
buck test mode/dev-nosan //caffe2/test:rpc_fork_thrift -- test_stress_heavy_rpc --stress-runs 10
```

```
buck test mode/dev-nosan //caffe2/test:rpc_spawn_thrift -- test_stress_heavy_rpc --stress-runs 10
```

Differential Revision: D19302814

fbshipit-source-id: 51632aede98e01689f8bc0f266788a9b020daa15
2020-01-07 15:18:00 -08:00
227d1a43a4 Revert D18838848: disable __torch_function__ overides for operators in torch.functional
Test Plan: revert-hammer

Differential Revision:
D18838848

Original commit changeset: 22b8015d7b2f

fbshipit-source-id: fdaeffcd112990ed379782cf7216d3f1beeb2cb1
2020-01-07 15:03:15 -08:00
8a0503b355 Run a non-quiet submodule update to prevent timeouts on Circle CI (#31900)
Summary:
As in title, this PR will disable the `--quiet` flag used in the CI as a workaround to a timeout hitting Mac OS CI.  Circle CI works by timing out when no text has been printed for 10 min.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31900

Differential Revision: D19302899

Pulled By: bwasti

fbshipit-source-id: 145647da983ee06f40794bda1abd580ea45a0019
2020-01-07 14:01:05 -08:00
114562cf93 For torch::from_blob() add clue when memory is non-owned. (#31222)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31222

 - When constructing torch::from_blob() in the case where the deleter is a nop, switch to using a nullptr context in the DataPtr (with a nop deleter)

 - No real extra memory/cpu requirements here, actually saves a minor alloc.

Why? Trying to get a signal that a Tensor might contain non-owned memory from
torch::from_blob(), by detecting the nullptr context.
ghstack-source-id: 96336078

Test Plan:
buck test mode/dev caffe2/test/cpp/api/...
   buck test mode/dev-nosan caffe2/test/...

Differential Revision: D18992119

fbshipit-source-id: 4eea642f82d0858b57fdfc6995364a760c10567d
2020-01-07 13:12:30 -08:00
ca72df06ae disable __torch_function__ overides for operators in torch.functional (#30839)
Summary:
For now I'm just removing the decorators from all of the currently overridable functions in `torch.functional`. This means they are no longer overridable, however this should fix the benchmark regressions reported in https://github.com/pytorch/pytorch/issues/30831. Moving forward we'll be looking at reducing the overhead of the python-level override mechanism and failing that, re-implementing all of these operators in C++.

cc hl475
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30839

Differential Revision: D18838848

Pulled By: ezyang

fbshipit-source-id: 22b8015d7b2f7a947f1ebc9632c998e081b48ad8
2020-01-07 12:27:28 -08:00
bb279c5c63 named tensor max pooling support
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/31669

Test Plan: Imported from OSS

Differential Revision: D19240348

Pulled By: glaringlee

fbshipit-source-id: 004387aa753e4e41afdede66647abbb0bcbd9808
2020-01-07 12:03:18 -08:00
3a2757c682 Fix tracing for modules with List[Tensor] as output (#31343)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31343

Fix an issue in TorchScript tracing for modules with `c10::List<at::Tensor>` as an output. TensorList was not supported properly.

Test Plan: unit tests

Reviewed By: wanchaol

Differential Revision: D18850722

fbshipit-source-id: 87a223104d1361fe754d55deceeb1e8bbcad629b
2020-01-07 11:57:25 -08:00
74d69e296e Raise an error if torch.cat is given out as one of the input tensors (#30577)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/30562 for both cpu and cuda.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30577

Differential Revision: D19298732

Pulled By: ezyang

fbshipit-source-id: ea539c97493ee17d8f60b1134d100a44c8717578
2020-01-07 11:30:33 -08:00
c888473b57 Restructure docs organization and naming (#31849)
Summary:
* Rename “Other Languages” → “Language Bindings”
* Move the Community section to the bottom
* Move "Language Bindings" above "Python API"
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31849

Differential Revision: D19290966

Pulled By: jlin27

fbshipit-source-id: 30b579e032a9fb1636e4afc7bbbd85a2708f637d
2020-01-07 11:16:53 -08:00
bf8e1c0710 Integrate async mode for autograd engine with distributed autograd. (#31508)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31508

This PR builds on top of https://github.com/pytorch/pytorch/pull/31230
to ensure that distributed autograd doesn't block an RPC thread anymore during
the backward pass.

I've also added a unit test where all ranks hammer rank 0 without about 60
backward calls (which would cause a deadlock earlier), but now such a test
passes without any issues.
ghstack-source-id: 96345097

Test Plan: waitforbuildbot

Differential Revision: D19188749

fbshipit-source-id: b21381b38175699afd0f9dce1ddc8ea6a220f589
2020-01-07 11:01:16 -08:00
0e5a6700cc Emit warning from deprecated torch function signatures (#31514)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/28430

The unpythonic signatures for functions such as `torch.addcdiv` are already seperated in [`deprecated.yaml`] and the signatures marked as deprecated in `PythonArgParser`. However, nothing was done with this information previously. So, this now emits a warning when the deprecated signatures are used.

One minor complication is that if all arguments are passed as keyword args then there is nothing to differentiate the deprecated overload. This can lead to false warnings being emitted. So, I've also modified `PythonArgParser` to prefer non-deprecated signatures.

[`deprecated.yaml`]: https://github.com/pytorch/pytorch/blob/master/tools/autograd/deprecated.yaml
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31514

Differential Revision: D19298735

Pulled By: ezyang

fbshipit-source-id: 03cb78af17658eaab9d577cd2497c6f413f07647
2020-01-07 10:57:53 -08:00
5cc62f2913 Ensure autograd callbacks are called only once for reentrant backward. (#31909)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31909

https://github.com/pytorch/pytorch/pull/31230 introduced a bug where
we would end up calling `graph_task_post_processing` twice for reentrant
backward calls (once when we mark the future completed and then we we called
graph_task_post_processing in execute_with_graph_task).

This PR fixes the issues by verifying the future we return in that case is
completed and we remove the call to graph_task_post_processing.

In addition to that I added a test that reproduced the problem and verified it
is fixed by this PR.
ghstack-source-id: 96349102

Test Plan: waitforbuildbot

Differential Revision: D19296363

fbshipit-source-id: dc01a4e95989709ad163bb0357b1d191ef5a4fb2
2020-01-07 10:35:04 -08:00
4ee9c56218 Support PyTorch ROCm CI on Ubuntu18.04 (#31886)
Summary:
In order to support Ubuntu18.04, some changes to the scripts are required.
* install dependencies with -y flag
* mark install noninteractive
* install some required dependencies (gpg-agent, python3-distutils, libidn11)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31886

Differential Revision: D19300586

Pulled By: bddppq

fbshipit-source-id: d7fb815a3845697ce63af191a5bc449d661ff1de
2020-01-07 10:32:47 -08:00
2f5eefe525 Raise ValueError if CUDA device is specified without specifying the : (#29087)
Summary:
Fix for https://github.com/pytorch/pytorch/issues/19076
Pull Request resolved: https://github.com/pytorch/pytorch/pull/29087

Differential Revision: D19298959

Pulled By: ezyang

fbshipit-source-id: 878ea4840682012f07177d8d159a77c0e5afada6
2020-01-07 10:29:49 -08:00
3c7db5ccbc Don't unconditionally compile runJITCPPTests (#31236)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31236

It is not compiled on Windows

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Differential Revision: D19262581

Pulled By: ezyang

fbshipit-source-id: 80bfa553333a946f00291aaca6ad26313caaa9e6
2020-01-07 10:24:52 -08:00
809ee9d04c Enable personalized FC weight_init and sparse_emb weight_init (#31707)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31707

Change the initialization value for FC weight init and sparse embedding lookup init.

Previous default initialization is uniform(-\sqrt(1/input_dim), \sqrt(1/input_dim)); Now pass into a flexible hyperparameter, say \alpha into it, to change into uniform(-\sqrt(\alpha/input_dim), \sqrt(\alpha/input_dim));

Reviewed By: chonglinsun

Differential Revision: D18825615

fbshipit-source-id: 4c5f2e07f2b3f5d642fd96d64dbf68892ebeb30b
2020-01-07 10:10:54 -08:00
22044c6f7c Use TORCH_CHECK instead of AT_ASSERT in torch::cuda::gather() (#27456)
Summary:
The error message produced by AT_ASSERT() in gather() encouraged users to file a bug report ("please report a bug to PyTorch..."). The assertion should be a regular argument check since it can be triggered by passing tensors with different dimensionality, e.g. `torch.cuda.comm.gather([torch.rand(1, device='cuda'), torch.rand(1, 1, device='cuda')])`.

See: https://github.com/pytorch/pytorch/issues/26400
Pull Request resolved: https://github.com/pytorch/pytorch/pull/27456

Differential Revision: D19300270

Pulled By: ezyang

fbshipit-source-id: ec87d225e23445020b377521e0daccceb4748215
2020-01-07 10:04:24 -08:00
20c5dd59bd Add stub for transformer.py and MultiheadAttention Class. (#28396)
Summary:
Add stub for `transformer.py` and `class MultiheadAttention`. Add import for `transformer.py`  and `class MultiheadAttention` in `__init__.pyi.in`. I've tested the code hint in PyCharm and all works file.
Relate issue: [https://github.com/pytorch/pytorch/issues/27842](https://github.com/pytorch/pytorch/issues/27842)
ezyang
Pull Request resolved: https://github.com/pytorch/pytorch/pull/28396

Differential Revision: D19300287

Pulled By: ezyang

fbshipit-source-id: 1a79d6518b5edd4643892c46a959108385c739ad
2020-01-07 09:13:36 -08:00
346a349111 Update all instances of 1.4.0 -> 1.5.0 (#31785)
Summary:
Done with:

```
❯ sed -i 's/1\.4\.0/1.5.0/g' $(find -type f -not -path "./third_party/*")
```

This was previously done in separate commits, but it would be beneficial to bump all included projects within this repository at the same time.

Old bumps for reference:
* [iOS]Update Cocoapods to 1.4.0: https://github.com/pytorch/pytorch/pull/30326
* [android] Change nightly builds version to 1.4.0-SNAPSHOT: https://github.com/pytorch/pytorch/pull/27381
* Roll master to 1.4.0: https://github.com/pytorch/pytorch/pull/27374

Signed-off-by: Eli Uriegas <eliuriegas@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31785

Differential Revision: D19277925

Pulled By: seemethere

fbshipit-source-id: f72ad082f0566004858c9374879f4b1bee169f9c
2020-01-07 08:00:17 -08:00
985fd970aa Enable BFloat16 support for Convolutions on ROCm (#30948)
Summary:
This PR adds bfloat16 support for convolutions on ROCm.

- Intergrates MIOpen bfloat16 convolution support into PyTorch

- Enables bfloat16 convolution for non-miopen paths, i.e THCUNN, native hip kernels

- Enables bfloat16 type for probability distribution functions(this is included in this PR since conv unit tests use bfloat16 random number generators)

Native cuda kernels for convolution and random functions will be compiled for CUDA as well.

iotamudelta bddppq
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30948

Differential Revision: D19274164

Pulled By: ezyang

fbshipit-source-id: c0888a6ac72a2c5749b1ebb2195ac6f2209996be
2020-01-07 06:57:35 -08:00
a561a8448b minor doc tweak to use mp.spawn in example (#30381)
Summary:
Per pietern's comment in https://github.com/pytorch/pytorch/issues/30022, we can make this example launcher a bit simpler by using `torch.multiprocessing`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30381

Differential Revision: D19292080

Pulled By: rohan-varma

fbshipit-source-id: 018ace945601166ef3af05d8c3e69d900bd77c3b
2020-01-06 22:19:01 -08:00
34561dadcd Don't handle bias inside cudnn_convolution* (#31524)
Summary:
Compared to cuDNN bias, PyTorch add has the following advantage:
- faster, especially for backward (see: https://github.com/zasdfgbnm/things/blob/master/2019/conv-backward-profile.md)
- handles 64bit indexing automatically
- has less code, less maintenance effort

ngimel I submit this PR early so the CI could start building it. But I have not tested it locally yet (still waiting for compiling).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31524

Differential Revision: D19264244

Pulled By: ngimel

fbshipit-source-id: cb483d378a6d8bce0a05c3643a796e544bd8e8f0
2020-01-06 16:47:54 -08:00
5d80f63478 no_grad, enable_grad: support for decorating generator functions (#31792)
Summary:
Closes https://github.com/pytorch/pytorch/issues/31497

This allows `torch.no_grad` and `torch.enable_grad` to be used as decorators for generator functions. In which case it disables/enables grad only inside the body of the generator and restores the context outside of the generator.

https://github.com/pytorch/pytorch/issues/31497 doesn't include a complete reproducer but the included test with `torch.is_grad_enabled` show this is working where it failed before.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31792

Differential Revision: D19274971

Pulled By: albanD

fbshipit-source-id: fde6d3fd95d76c8d324ad02db577213a4b68ccbe
2020-01-06 15:21:20 -08:00
58cffbff91 Add missing TORCH_CUDA_API annotation to throw_nccl_error (#31157)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31157

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Differential Revision: D19262583

Pulled By: ezyang

fbshipit-source-id: 8fb87b41ab53770329b38e1e2fe679fb868fee12
2020-01-06 14:39:51 -08:00
4ef9daf7b2 Remove dead CAFFE2_LIBS variable (#31155)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31155

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Differential Revision: D19262584

Pulled By: ezyang

fbshipit-source-id: 147ac5a9c36e813ea9a2f68b498880942d661be5
2020-01-06 14:39:47 -08:00
a9dae70bae Remove LibIRC logic from cmake. (#31152)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31152

Per apaszke: I can't find any reasonable references to libIRC online, so
I decided to remove this.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Differential Revision: D19262582

Pulled By: ezyang

fbshipit-source-id: a1d47462427a3e0ca469062321d608e0badf8548
2020-01-06 14:39:43 -08:00
112196fdee Fix index put (#31552)
Summary:
This change is required for cases like:
x[1:] = data or x[:3] = data
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31552

Reviewed By: hl475

Differential Revision: D19238815

Pulled By: houseroad

fbshipit-source-id: 56c9837d86b341ea92b0a71d55034ce189d12e6c
2020-01-06 14:09:48 -08:00
78cba90a8c Enable constant folding for Reshape (#31054)
Summary:
Enabled constant folding for onnx::Reshape
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31054

Reviewed By: hl475

Differential Revision: D18946951

Pulled By: houseroad

fbshipit-source-id: 499e8bf5fb091a94f7a27cbdf4311a23b1a6e3d3
2020-01-06 13:35:44 -08:00
492ca46e71 Fix androidTest - exclude host tests from it
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/31522

Test Plan: Imported from OSS

Reviewed By: dreiss

Differential Revision: D19200861

Pulled By: IvanKobzarev

fbshipit-source-id: a6024f3013398f9e0d237e06c984a20493d42f11
2020-01-06 11:29:46 -08:00
c65305e991 Add a check method for custom type tensor (#31290)
Summary:
For backend integration, backend (e.g. Glow) needs to check the content of the tensor to determine whether it is a legit byte tensor or some special packed format. This provides a convenient interface for that.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31290

Reviewed By: jackm321, qizzzh

Differential Revision: D19069684

Pulled By: yinghai

fbshipit-source-id: 63360fa2c4d32695fe9767a40027d446d63efdd4
2020-01-06 11:15:33 -08:00
1f2b6d632a Refactor tests in pytorch's test/dist_autograd_test.py file (#31803)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31803

Refactored the following fairly similar functions:
  1. `test_context_cleanup_tensor_with_grad`
  2. `test_context_cleanup_tensor_no_grad`
  3. `test_context_cleanup_no_tensors`
by creating a helper function `context_cleanup_test_helper` that can be invoked with the appropriate arguments.

Test Plan: Verified by running tests.

Differential Revision: D19269246

fbshipit-source-id: bfb42b078ad56b97ceeecf0d68b4169768c2c453
2020-01-06 10:59:00 -08:00
ddff014b79 fixed scale_factor calculation for uint8 tensor (#31778)
Summary:
When calling the add_images() method on the tensorboard SummaryWriter with a uint8 NCHW tensor, the tensor is incorrectly scaled, resulting in overflow behavior. This leads to incorrect images being displayed in tensorboard.

Issue: https://github.com/pytorch/pytorch/issues/31459

Local Testing (ran this code with and without the PR changes and printed scale_factor):

import torch
import torchvision
from torch.utils.tensorboard import SummaryWriter

writer = SummaryWriter()
x=torch.tensor([[[[1, 2, 3], [4, 5, 6]]]], dtype=torch.uint8)
writer.add_images("images", x)

Before- scale_factor: 255, After- scale_factor: 1
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31778

Differential Revision: D19289189

Pulled By: anjali411

fbshipit-source-id: 350a1650337244deae4fd8f8b7fb0e354ae6986b
2020-01-06 10:27:35 -08:00
1ba1799a66 C++ added 3rd arg of false to BatchNorm/InstanceNorm register_parameter … (#31873)
Summary:
Fix for issue https://github.com/pytorch/pytorch/issues/31680
C++ BatchNorm & InstanceNorm attempt to register undefined tensors when affine is false.

Fixes https://github.com/pytorch/pytorch/issues/31680
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31873

Differential Revision: D19287087

Pulled By: yf225

fbshipit-source-id: 0d57f10c49083386919b703d72b520a73a8e9e7f
2020-01-06 01:46:24 -08:00
33430cf094 Revert D18643137: Implement backend-agnostic rpc._wait_all_workers() utility
Test Plan: revert-hammer

Differential Revision:
D18643137

Original commit changeset: d669d4fc9ad6

fbshipit-source-id: fe1f8ed77c1c5760638fef06e67ba100b86c33e9
2020-01-05 11:58:51 -08:00
fde94e7556 Provide async mode for local autograd engine. (#31230)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31230

A major issue with distributed autograd currently is that we block an
RPC thread when we call Engine::execute_with_graph_task.

To resolve this issue, I've made modifications to the local autograd engine
such that `execute_with_graph_task` returns a Future instead. The `execute()`
methods for Engine::execute() and DistEngine::execute() still wait() on this
Future which ensures there is no change in behavior yet.

In follow up PRs we can modify the distributed autograd engine to take
advantage of this Future.

Closes #26359
ghstack-source-id: 96298057

Test Plan: waitforbuildbot

Differential Revision: D18999709

fbshipit-source-id: 388f54467fd2415a0acb7df17bd063aedc105229
2020-01-05 00:29:28 -08:00
3f0b330736 corrected keyword argument name in docs for Tensor.scatter (#31617)
Summary:
See https://github.com/pytorch/pytorch/issues/31601
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31617

Differential Revision: D19268872

Pulled By: mruberry

fbshipit-source-id: 52f0213f4aab991fd549b7623556a2ced61631a6
2020-01-04 21:48:30 -08:00
9020d30fc9 Updating submodules
Summary:
GitHub commits:

d7f0e32081
f2a603d2df
323a2bc3e5
04c07965ef
c179d38294
6fac956f22

Test Plan: n/a

Reviewed By: zpao

fbshipit-source-id: 558f35dbf1adb3b45179629c61d77488e441d4e3
2020-01-04 21:43:31 -08:00
502533cfe6 Implement backend-agnostic rpc._wait_all_workers() utility (#30710)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30710

We need a backend-agnostic mechanism to do barrier-like operation before locally destroy RRef context and shutdown RPC Agent.

- Sort worker names.
- Elect the first name as the leader in the ordered worker names.
- Followers reports therir intent to synchronize to the leader.
- Leader also reports to itself, when `_wait_all_workers()` called.
- If all workers report their intent to proceed, leader send the command to every one to proceed.

Test Plan:
# Unit tests

```
buck test mode/dev-nosan //caffe2/test:rpc_fork -- test_wait_all_workers

buck-out/gen/caffe2/test/rpc_fork\#binary.par -r test_wait_all_workers$
buck-out/gen/caffe2/test/rpc_fork\#binary.par -r test_rref_leak
buck-out/gen/caffe2/test/rpc_fork\#binary.par -r test_rref_forward_chain
```

```
buck test mode/dev-nosan //caffe2/test:rpc_fork_thrift -- test_wait_all_workers

buck-out/gen/caffe2/test/rpc_fork_thrift\#binary.par -r test_wait_all_workers$
```

# Stress runs
```
buck test mode/dev-nosan //caffe2/test:rpc_fork_thrift -- test_stress_light_rpc --stress-runs 10
```

```
buck test mode/dev-nosan //caffe2/test:rpc_spawn_thrift -- test_stress_light_rpc --stress-runs 10
```

```
buck test mode/dev-nosan //caffe2/test:rpc_fork_thrift -- test_stress_heavy_rpc --stress-runs 10
```

```
buck test mode/dev-nosan //caffe2/test:rpc_spawn_thrift -- test_stress_heavy_rpc --stress-runs 10
```

# Debug

```
buck test mode/dev-nosan caffe2/test:rpc_fork -- test_shutdown
```

```
buck test mode/dev-nosan //caffe2/test:dist_autograd_fork -- test_clean_context_during_backward

buck build mode/dev-nosan //caffe2/test:dist_autograd_fork

buck-out/gen/caffe2/test/dist_autograd_fork\#binary.par -r test_clean_context_during_backward
```

https://our.intern.facebook.com/intern/testinfra/diagnostics/281475127895800.844424945328750.1575664368/

```
I1206 12:27:47.491420 185619 process_group_agent.cpp:211] Shutting down ProcessGroupAgent.
I1206 12:27:47.493880 185630 process_group_agent.cpp:211] Shutting down ProcessGroupAgent.
I1206 12:27:47.494526 185625 process_group_agent.cpp:211] Shutting down ProcessGroupAgent.
I1206 12:27:47.495390 185636 process_group_agent.cpp:211] Shutting down ProcessGroupAgent.
E1206 12:27:47.544198 185627 pair.cc:642] 1 --->>> 0, read ERROR: AsyncSocketException: Network error, type = Network error, errno = 104 (Connection reset by peer)
E1206 12:27:47.544203 185633 pair.cc:642] 2 --->>> 0, read ERROR: AsyncSocketException: Network error, type = Network error, errno = 104 (Connection reset by peer)
E1206 12:27:47.544210 185639 pair.cc:642] 3 --->>> 0, read ERROR: AsyncSocketException: Network error, type = Network error, errno = 104 (Connection reset by peer)
```
This should mean the UDF in the request has been run, so Python proceeded and ran to `_agent.shutdown()`.

While the RpcAgents on followers wanted to send back the response, but the leader has closed RPC.

Need to re-trigger "pytorch_rpc-buck" to reproduce the rare-seen issue.

Differential Revision: D18643137

fbshipit-source-id: d669d4fc9ad65ed48bed1329a4eb1c32ba51323c
2020-01-04 17:13:44 -08:00
f362cd510d Move prim ops from JIT registration to C10 (#30612)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30612

The first version to move prim ops to c10 registration. After the reviewers are fine with the initial changes, more operators will be moved in the same style.

Test Plan: Imported from OSS

Differential Revision: D19237648

Pulled By: iseeyuan

fbshipit-source-id: c5a519604efffb80564a556536f17d829f71d9f9
2020-01-04 13:47:44 -08:00
5579611544 Enable foldbn tests (#29220)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/29220

Support for accessing constant is added in previous
PRs, this PR re-enables the foldbn tests

Test Plan:
test_jit.py

Imported from OSS

Differential Revision: D18846848

fbshipit-source-id: 90ceaf42539ffee80b984e0d8b2420da66c263c3
2020-01-04 11:47:01 -08:00
ebe69236d1 Expose class constant through attr and setattr in object (#29219)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/29219

We added class constant in previous PRs, this PR allows access to
class constant in the object API

Test Plan:
build/bin/test_jit
python test/test_jit.py

Imported from OSS

Differential Revision: D18846851

fbshipit-source-id: 888a6517d5f747d1f8ced283c0c2c30b2f6c72c6
2020-01-04 11:09:35 -08:00
6f62c311a1 Add unsafeRemoveConstant for ClassType (#30787)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30787

This is needed when we fuse conv bn modules,
where we need to rewrite a constant bias (None) of conv to an attribute
bias of Tensor

Test Plan:
build/bin/test_jit

Imported from OSS

Differential Revision: D18846850

fbshipit-source-id: 9fd5fe85d93d07226e180b75d2e068fe00ca25fe
2020-01-04 01:11:59 -08:00
2bac76969c Fix getConstant (#31012)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31012

- getConstant should throw when the item is not found
- add another getConstant which takes slot index as argument

Test Plan:
test_class_type.cpp

Imported from OSS

Differential Revision: D18898418

fbshipit-source-id: d3a23a4896fdbf5fa98e1c55c9c4d6205840014b
2020-01-03 23:06:11 -08:00
8420f205ee Remove refs from ArrayRef arguments (#31845)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31845

ArrayRef is trivially copyable and should be passed by value. Removing
unnecessary `&`s.

Test Plan: Imported from OSS

Differential Revision: D19278523

Pulled By: suo

fbshipit-source-id: 026db693ea98d19246b02c48d49d1929ecb6478e
2020-01-03 22:50:55 -08:00
b0a2765103 move docker image html to correct bucket (#31832)
Summary:
save docker image version to docker.pytorch.org bucket to be served with http://docker.pytorch.org

test result: https://s3.amazonaws.com/docker.pytorch.org/pytorch.html
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31832

Differential Revision: D19281263

Pulled By: mingbowan

fbshipit-source-id: d906a72d419876c81a570a2086b2d8d2c47d5d17
2020-01-03 21:38:58 -08:00
5fe3604987 Preserve constant from ConcreteModuleType to ClassType (#29218)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/29218

We need to be able to access constant in module.

Test Plan:
tbd

Imported from OSS

Differential Revision: D18846847

fbshipit-source-id: 22d2c485c3c449bc14ad798f6e1a0c64fc8fb346
2020-01-03 21:30:04 -08:00
e5b7231edc Adding version check for hypothesis deadline
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/31262

Test Plan: Imported from OSS

Differential Revision: D19036700

Pulled By: z-a-f

fbshipit-source-id: 8e898a6f064dfb4876aa0d3cc299288b5af7b37d
2020-01-03 19:17:55 -08:00
28c9dd4436 fix ProcessGroupGlooTest (#31255)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31255

This test had 2 issues. A timeout would occasionally happen due to a timeout of 50ms, and CUDA could would get compiled and run on CPU, leading to errors. This PR fixes those issues.

Differential Revision: D19028231

fbshipit-source-id: e50752228affe0021e7c0caa83bce78d76473759
2020-01-03 18:35:29 -08:00
27488773b0 Updating submodules
Summary:
GitHub commits:

8c7c0e201e
b84db9a971
0524fa0b36
2df7b2ba54
80553514ed
4eb66bc7aa

Test Plan: n/a

Reviewed By: zpao

fbshipit-source-id: 97d0605beabcfc15236038215208acf034f8eba4
2020-01-03 17:04:54 -08:00
c829c6f3d2 Disable flaky test_debug_info
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/31847

Test Plan: Imported from OSS

Differential Revision: D19278009

Pulled By: mrshenli

fbshipit-source-id: 652fa6741a48f35d9f8f54534e84d64fdd96b439
2020-01-03 17:01:27 -08:00
6b1db202bc Add tanh to c10::cuda::compat (#31844)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31844

Add tanh to c10::cuda::compat

Test Plan: unittest

Reviewed By: bddppq

Differential Revision: D19277230

fbshipit-source-id: d2cceea58722393ecb90aacec05b692dbb92d467
2020-01-03 14:27:36 -08:00
9407137102 Update the descriptive error message for enforce fail (#31575)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31575

We need a new exception class specifically for the enforce_finite operator, because we need to map it to a specific python exception ExitException, not the RuntimeError type that all c10::Errors get mapped to by default. This diff includes:
- Define c10::EnforceFiniteNotMet
- API CAFFE_ENFORCE_FINITE to throw c10::EnforceFiniteNotMet
- Map from c10::EnforceFiniteNotMet to python ExitException
- Apply CAFFE_ENFORCE_FINITE in caffe2 op

Test Plan:
- integration test pass: https://fburl.com/fblearner/xwkzbqyo
- integration test with D19213617: https://fburl.com/fblearner/479y4jrj Generate error message as desired

- Example:
  - Original error message  f157597803
{F225477055}

  - Updated error message  (with D19213617 to generate the error): f158571327
{F225477071}

Reviewed By: zheng-xq

Differential Revision: D19206240

fbshipit-source-id: bd256862801d5957a26b76d738edf4e531f03827
2020-01-03 13:53:20 -08:00
40e720282c Using _floats_wrapper in per_channel_tensor generation (#31780)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31780

We need to specify width to ensure the generated float is representable by `float32`
fixes: https://github.com/pytorch/pytorch/issues/31774

Test Plan:
ci

Imported from OSS

Differential Revision: D19275165

fbshipit-source-id: 50560b4208c562b6bcd2abccadd234f29fbb4b0a
2020-01-03 13:40:08 -08:00
86a4e2135d Do not register const float * type on utiliy_ops.cu (#31583)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31583

But rather use `float *`, which is alredy registered

Test Plan: CI

Reviewed By: xianjiec

Differential Revision: D19221405

fbshipit-source-id: eb8eabcf828745022bc1e4185a0e65abd19a8f04
2020-01-03 13:28:26 -08:00
457c57d9f7 use unordered_set instead of vector for futureTimeouts key in (#31813)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31813

Closes https://github.com/pytorch/pytorch/issues/31804. We were using
an `std::vector` for the key for a map that keeps track of futures to mark them
if they timeout, but we can instead use an `unordered_set`. This results in a
faster lookup in the code block where we remove futureIDs from this set when
they complete successfully. Previously we were finding them via a linear
`std::find`. Switching it to a constant time find will help performance in the
case where a large number of futures are scheduled to time out at the same
time, or if there is no timeout enforced.

To benchmark a rough perf improvement, I created 50k futures with the same
timeout. Before this PR, the lookup `std::find(futuresAtTime.begin(),
futuresAtTime.end(), id)` took ~200us, now it takes 1us.
ghstack-source-id: 96251355

Test Plan: Unit tests pass.

Differential Revision: D19269798

fbshipit-source-id: 1a0fa84a478ee27a16ab0b9fa6f5413b065a663e
2020-01-03 13:21:23 -08:00
b44c0f328e Skip same tests in ONNX Python3 CI as in Python2 (#31827)
Summary:
resolve https://github.com/pytorch/pytorch/issues/31103

vgg models were not tested in Python2 but are turned on in Python3
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31827

Reviewed By: houseroad

Differential Revision: D19274123

Pulled By: bddppq

fbshipit-source-id: c48beb574e8b03b2adbd6c9d8ca3f600bee93024
2020-01-03 12:42:42 -08:00
79e30ff3f8 optimize index_select performance on CPU with TensorIterator (#30598)
Summary:
This PR aims at improving `index_select` performance on CPU with `TensorIterator`.
The code has equally effective optimization for both contiguous tensor and non-contiguous tensor.
The code will try to parallel inner loop in case the slice of copy is large enough, otherwise it will parallel on outer loop.
Thus both the user scenarios from DLRM (from `Embedding`) and Fairseq transformer is covered.

1. for contiguous input, single socket: **1.25x** performance speedup
2. for non-contiguous input, single socket: **799x** performance speedup
3. for contiguous input, single core: same performance
4. for non-contiguous input, single core: **31x** performance speedup
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30598

Differential Revision: D19266892

Pulled By: VitalyFedyunin

fbshipit-source-id: 7aaf8e2c861b4a96250c968c4dd95c8d2c5b92d7
2020-01-03 11:59:43 -08:00
0ae063d5d9 Fixed concatenation benchmark + added it to the microbenchmarking runs
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/31587

Test Plan: Imported from OSS

Differential Revision: D19221813

Pulled By: z-a-f

fbshipit-source-id: ee0eb60da7899b23fdc63326302d1e2fd4b540ee
2020-01-03 11:23:12 -08:00
9c9d3cd550 Revert D19262570: Fix race condition when creating build dir
Test Plan: revert-hammer

Differential Revision:
D19262570

Original commit changeset: bb18c72e4264

fbshipit-source-id: 40675ef6ef4c98629deaaef0b25956f92534ff50
2020-01-03 11:17:42 -08:00
a02a5129a8 Move rrelu to Aten(CPU) (#31094)
Summary:
VitalyFedyunin, this PR is about port rrelu activation to Aten:
Test script:
```
import torch
import torch.nn as nn
import time

torch.manual_seed(0)

def _time():
    return time.time()

device = "cpu"
m = nn.RReLU(0.1, 0.3).train()
# for inference
#m = nn.RReLU(0.1, 0.3).eval()
#warm up
for n in [1, 10, 100, 1000]:
    input = torch.randn(128, n, requires_grad=True, device=device)
    grad_output = torch.randn(128, n, device=device)
    for i in range(1000):
        output = m(input)
        output.backward(grad_output)

for n in [1, 10, 100, 1000]:
    input = torch.randn(128, n, requires_grad=True, device=device)
    grad_output = torch.randn(128, n, device=device)
    fwd_t = 0
    bwd_t = 0
    for i in range(10000):
        t1 = _time()
        output = m(input)
        t2 = _time()
        output.backward(grad_output)
        t3 = _time()
        fwd_t = fwd_t + (t2 -t1)
        bwd_t = bwd_t + (t3 - t2)
    fwd_avg = fwd_t / 10000 * 1000
    bwd_avg = bwd_t / 10000 * 1000
    print("input size(128, %d) forward time is %.2f (ms); backwad avg time is %.2f (ms)."
          % (n, fwd_avg, bwd_avg))
```
**Before:**
```
Training:
input size(128, 1) forward time is 0.01 (ms); backwad avg time is 0.03 (ms).
input size(128, 10) forward time is 0.03 (ms); backwad avg time is 0.04 (ms).
input size(128, 100) forward time is 0.17 (ms); backwad avg time is 0.06 (ms).
input size(128, 1000) forward time is 1.45 (ms); backwad avg time is 0.07 (ms).
inferecne:
input size(128, 1) forward time is 0.01 (ms).
input size(128, 10) forward time is 0.01 (ms).
input size(128, 100) forward time is 0.02 (ms).
input size(128, 1000) forward time is 0.15 (ms).
```
**After:**
```
Training:
input size(128, 1) forward time is 0.01 (ms); backwad avg time is 0.03 (ms).
input size(128, 10) forward time is 0.03 (ms); backwad avg time is 0.04 (ms).
input size(128, 100) forward time is 0.17 (ms); backwad avg time is 0.07 (ms).
input size(128, 1000) forward time is 1.43 (ms); backwad avg time is 0.08 (ms).
inferecne:
input size(128, 1) forward time is 0.02 (ms).
input size(128, 10) forward time is 0.02 (ms).
input size(128, 100) forward time is 0.02 (ms).
input size(128, 1000) forward time is 0.03 (ms).
```
**OMP_NUM_THREADS=1:**
```
Before:
Training:
input size(128, 1) forward time is 0.01 (ms); backwad avg time is 0.02 (ms).
input size(128, 10) forward time is 0.02 (ms); backwad avg time is 0.02 (ms).
input size(128, 100) forward time is 0.15 (ms); backwad avg time is 0.03 (ms).
input size(128, 1000) forward time is 1.45 (ms); backwad avg time is 0.14 (ms).
inferecne:
input size(128, 1) forward time is 0.01 (ms).
input size(128, 10) forward time is 0.01 (ms).
input size(128, 100) forward time is 0.02 (ms).
input size(128, 1000) forward time is 0.20 (ms).

After:
Training:
input size(128, 1) forward time is 0.01 (ms); backwad avg time is 0.02 (ms).
input size(128, 10) forward time is 0.02 (ms); backwad avg time is 0.02 (ms).
input size(128, 100) forward time is 0.15 (ms); backwad avg time is 0.03 (ms).
input size(128, 1000) forward time is 1.43 (ms); backwad avg time is 0.15 (ms).
inferecne:
input size(128, 1) forward time is 0.01 (ms).
input size(128, 10) forward time is 0.02 (ms).
input size(128, 100) forward time is 0.02 (ms).
input size(128, 1000) forward time is 0.06 (ms).
```
Fix https://github.com/pytorch/pytorch/issues/24755, https://github.com/pytorch/pytorch/issues/24756.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31094

Differential Revision: D19270936

Pulled By: VitalyFedyunin

fbshipit-source-id: 11bb3236b1037a558022d3777d1f9a429af2bffe
2020-01-03 11:10:00 -08:00
b47e9b97a2 Add op bitwise_and (#31104)
Summary:
Refer to https://github.com/pytorch/pytorch/pull/25665,  add `bitwise_and` operator.
Benchmark script :
```
import timeit
#for __and__
for n, t in [(10, 100000),(1000, 10000)]:
    print('__and__ (a.numel() == {}) for {} times'.format(n, t))
    for device in ('cpu', 'cuda'):
        for dtype in ('torch.int8', 'torch.uint8', 'torch.int16', 'torch.int32', 'torch.int64'):
            print(f'device: {device}, dtype: {dtype}, {t} times', end='\t\t')
            print(timeit.timeit(f'a & b\nif "{device}" == "cuda": torch.cuda.synchronize()', setup=f'import torch; a = torch.randint(0, 10, ({n},), dtype = {dtype}, device="{device}"); b = torch.randint(0, 10, ({n},), dtype = {dtype}, device="{device}")', number=t))
#for __iand__
for n, t in [(10, 100000),(1000, 10000)]:
    print('__iand__ (a.numel() == {}) for {} times'.format(n, t))
    for device in ('cpu', 'cuda'):
        for dtype in ('torch.int8', 'torch.uint8', 'torch.int16', 'torch.int32', 'torch.int64'):
            print(f'device: {device}, dtype: {dtype}, {t} times', end='\t\t')
            print(timeit.timeit(f'a & b\nif "{device}" == "cuda": torch.cuda.synchronize()', setup=f'import torch; a = torch.randint(0, 10, ({n},), dtype = {dtype}, device="{device}"); b = torch.tensor(5, dtype = {dtype}, device="{device}")', number=t))
```
Device: **Tesla P100, skx-8180**
Cuda verison: **9.0.176**

Before:
```
__and__ (a.numel() == 10) for 100000 times
device: cpu, dtype: torch.int8, 100000 times            0.1766007635742426
device: cpu, dtype: torch.uint8, 100000 times           0.17322628945112228
device: cpu, dtype: torch.int16, 100000 times           0.17650844901800156
device: cpu, dtype: torch.int32, 100000 times           0.17711848113685846
device: cpu, dtype: torch.int64, 100000 times           0.18240160401910543
device: cuda, dtype: torch.int8, 100000 times           1.273967768996954
device: cuda, dtype: torch.uint8, 100000 times          1.2778537990525365
device: cuda, dtype: torch.int16, 100000 times          1.2753686187788844
device: cuda, dtype: torch.int32, 100000 times          1.2797665279358625
device: cuda, dtype: torch.int64, 100000 times          1.2933144550770521
__and__ (a.numel() == 1000) for 10000 times
device: cpu, dtype: torch.int8, 10000 times             0.031139614060521126
device: cpu, dtype: torch.uint8, 10000 times            0.03091452084481716
device: cpu, dtype: torch.int16, 10000 times            0.022756479680538177
device: cpu, dtype: torch.int32, 10000 times            0.025045674294233322
device: cpu, dtype: torch.int64, 10000 times            0.024164282716810703
device: cuda, dtype: torch.int8, 10000 times            0.12820732593536377
device: cuda, dtype: torch.uint8, 10000 times           0.12775669433176517
device: cuda, dtype: torch.int16, 10000 times           0.12697868794202805
device: cuda, dtype: torch.int32, 10000 times           0.12832533661276102
device: cuda, dtype: torch.int64, 10000 times           0.1280576130375266
__iand__ (a.numel() == 10) for 100000 times
device: cpu, dtype: torch.int8, 100000 times            0.3687064303085208
device: cpu, dtype: torch.uint8, 100000 times           0.36253443732857704
device: cpu, dtype: torch.int16, 100000 times           0.362891579978168
device: cpu, dtype: torch.int32, 100000 times           0.37680106051266193
device: cpu, dtype: torch.int64, 100000 times           0.3689364707097411
device: cuda, dtype: torch.int8, 100000 times           1.419940729625523
device: cuda, dtype: torch.uint8, 100000 times          1.4247053815051913
device: cuda, dtype: torch.int16, 100000 times          1.4191444097086787
device: cuda, dtype: torch.int32, 100000 times          1.4305962566286325
device: cuda, dtype: torch.int64, 100000 times          1.4567416654899716
__iand__ (a.numel() == 1000) for 10000 times
device: cpu, dtype: torch.int8, 10000 times             0.06224383972585201
device: cpu, dtype: torch.uint8, 10000 times            0.06205617543309927
device: cpu, dtype: torch.int16, 10000 times            0.05016433447599411
device: cpu, dtype: torch.int32, 10000 times            0.05216377507895231
device: cpu, dtype: torch.int64, 10000 times            0.06139362137764692
device: cuda, dtype: torch.int8, 10000 times            0.14827249851077795
device: cuda, dtype: torch.uint8, 10000 times           0.14801877550780773
device: cuda, dtype: torch.int16, 10000 times           0.14952312968671322
device: cuda, dtype: torch.int32, 10000 times           0.14999118447303772
device: cuda, dtype: torch.int64, 10000 times           0.14951884001493454
```
After:
```
__and__ (a.numel() == 10) for 100000 times
device: cpu, dtype: torch.int8, 100000 times            0.23157884553074837
device: cpu, dtype: torch.uint8, 100000 times           0.23063660878688097
device: cpu, dtype: torch.int16, 100000 times           0.23005440644919872
device: cpu, dtype: torch.int32, 100000 times           0.23748818412423134
device: cpu, dtype: torch.int64, 100000 times           0.24106105230748653
device: cuda, dtype: torch.int8, 100000 times           1.4394256137311459
device: cuda, dtype: torch.uint8, 100000 times          1.4436759827658534
device: cuda, dtype: torch.int16, 100000 times          1.4631587155163288
device: cuda, dtype: torch.int32, 100000 times          1.459101552143693
device: cuda, dtype: torch.int64, 100000 times          1.4784048134461045
__and__ (a.numel() == 1000) for 10000 times
device: cpu, dtype: torch.int8, 10000 times             0.028442862443625927
device: cpu, dtype: torch.uint8, 10000 times            0.028130197897553444
device: cpu, dtype: torch.int16, 10000 times            0.025318274274468422
device: cpu, dtype: torch.int32, 10000 times            0.02519288007169962
device: cpu, dtype: torch.int64, 10000 times            0.028299466706812382
device: cuda, dtype: torch.int8, 10000 times            0.14342594426125288
device: cuda, dtype: torch.uint8, 10000 times           0.145280827768147
device: cuda, dtype: torch.int16, 10000 times           0.14673697855323553
device: cuda, dtype: torch.int32, 10000 times           0.14499565307050943
device: cuda, dtype: torch.int64, 10000 times           0.14582364354282618
__iand__ (a.numel() == 10) for 100000 times
device: cpu, dtype: torch.int8, 100000 times            0.25548241566866636
device: cpu, dtype: torch.uint8, 100000 times           0.2552562616765499
device: cpu, dtype: torch.int16, 100000 times           0.25905191246420145
device: cpu, dtype: torch.int32, 100000 times           0.26635489892214537
device: cpu, dtype: torch.int64, 100000 times           0.26269810926169157
device: cuda, dtype: torch.int8, 100000 times           1.485458506271243
device: cuda, dtype: torch.uint8, 100000 times          1.4742380809038877
device: cuda, dtype: torch.int16, 100000 times          1.507783885113895
device: cuda, dtype: torch.int32, 100000 times          1.4926990242674947
device: cuda, dtype: torch.int64, 100000 times          1.519851053133607
__iand__ (a.numel() == 1000) for 10000 times
device: cpu, dtype: torch.int8, 10000 times             0.03425929415971041
device: cpu, dtype: torch.uint8, 10000 times            0.03293587639927864
device: cpu, dtype: torch.int16, 10000 times            0.029559112153947353
device: cpu, dtype: torch.int32, 10000 times            0.030915481969714165
device: cpu, dtype: torch.int64, 10000 times            0.03292469773441553
device: cuda, dtype: torch.int8, 10000 times            0.15792148280888796
device: cuda, dtype: torch.uint8, 10000 times           0.16000914946198463
device: cuda, dtype: torch.int16, 10000 times           0.1600684942677617
device: cuda, dtype: torch.int32, 10000 times           0.16162546630948782
device: cuda, dtype: torch.int64, 10000 times           0.1629159888252616
```
Fix  https://github.com/pytorch/pytorch/issues/24508, https://github.com/pytorch/pytorch/issues/24509,  https://github.com/pytorch/pytorch/issues/24655, https://github.com/pytorch/pytorch/issues/24656.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31104

Differential Revision: D18938930

Pulled By: VitalyFedyunin

fbshipit-source-id: a77e805a0b84e8ace16c6e648c2f67dad44f2e44
2020-01-03 10:32:36 -08:00
68f3782106 remove std_single and var_single code in TH (#31608)
Summary:
std_single and var_single in TH never be used, remove them.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31608

Differential Revision: D19270920

Pulled By: VitalyFedyunin

fbshipit-source-id: e106a42383bf224f7e2c1c092b95484d23af4b0a
2020-01-03 10:16:52 -08:00
0b9cd410a9 Fix cumsum error for tensors with zero elements (#31694)
Summary:
Currently `cumsum` crashes for tensors with non-empty dimensions but with zero elements, which could happen when some dimension is zero. This commit fixes the error by checking both `dim()` and `numel()` in cumsum backward

Fixes https://github.com/pytorch/pytorch/issues/31515
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31694

Reviewed By: mrshenli

Differential Revision: D19266613

Pulled By: leedtan

fbshipit-source-id: 9407e0aa55440fed911c01a3580bb6c5eab62a16
2020-01-03 10:16:46 -08:00
daf00beaba Remove duplicated Numa detection code. (#30628)
Summary:
cmake/Dependencies.cmake (1111a6b810/cmake/Dependencies.cmake (L595-L609)) has already detected Numa. Duplicated detection and variables may lead to
incorrect results.

Close https://github.com/pytorch/pytorch/issues/29968
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30628

Differential Revision: D18782479

Pulled By: ezyang

fbshipit-source-id: f74441f03367f11af8fa59b92d656c6fa070fbd0
2020-01-03 08:48:46 -08:00
8c425dd201 Fix race condition when creating build dir (#30956)
Summary:
The original `check-and-act` style can raise `FileExistsError` when multiple processes are jit-compiling the extension on the same node.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30956

Differential Revision: D19262570

Pulled By: ezyang

fbshipit-source-id: bb18c72e42648770b47f9378ac7c3929c3c03efc
2020-01-03 07:58:26 -08:00
f56c59ead6 clarify when to use as_tuple in torch.nonzero
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/31798

Differential Revision: D19272332

Pulled By: zou3519

fbshipit-source-id: 954d086a7b9f1a719e0dac303a4253bf7ec8e9f4
2020-01-03 07:43:35 -08:00
95cb66570a Erase array sizes from types in c10::str(). (#31683)
Summary:
This dramatically reduces the number of instantiations and eliminates
~900KB of code from my local build of libtorch_cpu.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31683

Differential Revision: D19258364

Pulled By: resistor

fbshipit-source-id: addb921a26289978ffd14c203325ca7e35a4515b
2020-01-02 22:30:57 -08:00
f39105b68f add num_pending_users to debug info (#31539)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31539

Adding this metric primarily because it is needed to unblock unit
tests for https://github.com/pytorch/pytorch/pull/31381. It also may be useful
to look at this metric to see the number of pending RRef forks that currently
exist.
ghstack-source-id: 96230360

Test Plan: Modified the relevant unit test.

Differential Revision: D19204158

fbshipit-source-id: 016345e52cd02cc5f46837bffd8d589ba8575f29
2020-01-02 21:28:03 -08:00
5be8dac329 Remove non-ascii character from torch/onnx/symbolic_opset11.py
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/31814

Reviewed By: houseroad

Differential Revision: D19270742

Pulled By: bddppq

fbshipit-source-id: 80800d588e63701d6e1b5838d7ada993f0246a81
2020-01-02 20:54:32 -08:00
fc598f9023 generate op dependency graph as python code
Summary:
Add support to print op dependence as python code so that both custom
build script and BUCK can import it without yaml parser.

Test Plan:
- generate the file:
```
ANALYZE_TORCH=1 FORMAT=py DEPLOY=1 tools/code_analyzer/build.sh -closure=false
```

- load the file in python:
```
python
>>> from tools.code_analyzer.generated.torch import TORCH_DEPS
>>> print(TORCH_DEPS)
```

Differential Revision: D18894639

Pulled By: ljk53

fbshipit-source-id: e304d0525a07a13cf6e8a9317cd22637200d044c
2020-01-02 20:26:28 -08:00
fa0424f224 add LLVM-dev package to android docker image (#31215)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31215

Install LLVM-dev package for code analysis CI job: #30937

LLVM-dev package is not related to android NDK but the whole code
analysis thing is for mobile custom build so choose this docker image.

Test Plan: - wait docker image to build?

Differential Revision: D19193223

Pulled By: ljk53

fbshipit-source-id: 54a79daf8d98fa7c8b9eed11f519e1c7b1614be8
2020-01-02 20:26:24 -08:00
dc43f9dc54 fix test_backward_node_failure flakiness (#31588)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31588

Per title. This test can sometimes fail with a different error regex
than the one that is currently tested, so add this error regex to make the test
pass consistently.

Differential Revision: D19222275

fbshipit-source-id: 89c95276d4d9beccf9e0961f970493750d78a96b
2020-01-02 15:44:16 -08:00
155376721c Pin hypothesis package to 4.57.1 to avoid test failures
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/31794

Test Plan: Imported from OSS

Differential Revision: D19266039

Pulled By: mrshenli

fbshipit-source-id: 4b1839c4de2b4476c8173a79582c861bf4fa998f
2020-01-02 15:33:03 -08:00
5f8308e32d Pin Pillow to v6 as PILLOW_VERSION is removed in v7
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/31777

Test Plan: Imported from OSS

Differential Revision: D19264247

Pulled By: mrshenli

fbshipit-source-id: 52b0a3629e3a96ef2f9d3e289b9f7bb6a2745786
2020-01-02 15:32:58 -08:00
feb0ccdbfd Updating submodules
Summary:
GitHub commits:

123ae291fc
b9e9d4f7d9
86ea03e727
1cd1bfb668
917504ac42
06cc652030
e63819cbe3
6d21d8cfd3
b636829d55
19d0faece2
9860344e10

Test Plan: n/a

Reviewed By: zpao

fbshipit-source-id: 1de7509af788dc7861cfc779936fbc9e0146a5a5
2020-01-02 14:35:41 -08:00
ed5cd0d742 Use numeric limits to define TensorTypeSet(FULL) representation (#31668)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31668

This also removes an annoying warning about change of sign conversion

Test Plan: Run unit tests

Reviewed By: ezyang

Differential Revision: D19238631

fbshipit-source-id: 29b50abac635e530d5b0453c3a0f36a4573fbf5b
2020-01-02 12:54:02 -08:00
d770fbc1d2 Some modifications to improve readability (#31352)
Summary:
In the long string, formalstring thinks it is good to have a name.

When using dict, literal is better for readability and faster than dict constructor.

I always appreciate your efforts in creating the world's best frameworks.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31352

Differential Revision: D19191967

Pulled By: ngimel

fbshipit-source-id: 21f063b163b67de8cf9761a4db5991f74318e991
2020-01-02 12:48:34 -08:00
7078f4b27d skip _test_optional_float in BC check (#31786)
Summary:
Skip _test_optional_float
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31786

Reviewed By: hl475

Differential Revision: D19265059

Pulled By: houseroad

fbshipit-source-id: 6b95bd3b8cad83a4c459c0603befaaeeade6cdff
2020-01-02 11:12:38 -08:00
37fc59e847 Updating submodules
Summary:
GitHub commits:

17caab3d7b

Test Plan: n/a

Reviewed By: zpao

fbshipit-source-id: f4828cd5c81615d0df86f915b3abb6a58509aa79
2020-01-02 10:57:58 -08:00
9e9bfbfd8d Update old scheduler example usage (#31358)
Summary:
Update the old example usage in CosineAnnealingWarm, `scheduler.step()` should be called after `optimizer.step()`.

https://github.com/pytorch/pytorch/issues/20028#issuecomment-566061580
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31358

Differential Revision: D19199311

Pulled By: vincentqb

fbshipit-source-id: cb29b95f8277d2dfa75ec2a83c1af03a5c9c9a69
2020-01-02 09:15:04 -08:00
c4f10e0fe7 Renaming scales parameter for interpolate (#31526)
Summary:
PR separated from https://github.com/pytorch/pytorch/pull/31274.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31526

Reviewed By: zou3519

Differential Revision: D19221931

Pulled By: gchanan

fbshipit-source-id: 81958a9910867ac9d62f2b47abc49384526c4e51
2020-01-02 08:19:30 -08:00
236b0a318c Delete ATen/stub (#31763)
Summary:
This folder contained an empty CombinedStub file which isn't explicitly used anywhere.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31763

Differential Revision: D19262563

Pulled By: ezyang

fbshipit-source-id: 5d095c93d6f7a1cc35f5919aa6006b31c2376b18
2020-01-02 07:04:07 -08:00
cb1af5f61f Revert D19233558: add float[] str[] constants
Test Plan: revert-hammer

Differential Revision:
D19233558

Original commit changeset: 4f7c6d9ddbe7

fbshipit-source-id: a5020a9169e349a5970323471d673e8cd7818c66
2019-12-31 11:57:34 -08:00
7a3ed36309 Fix nvcc math functions for MSVC 2019 (#31704)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/31108.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31704

Differential Revision: D19256110

Pulled By: mingbowan

fbshipit-source-id: a4aba2830aba002497f70a75ef995e5e7de08393
2019-12-31 10:52:12 -08:00
1499b894c4 Apply clang-format to csrc/distributed/rpc
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/31681

Test Plan: Imported from OSS

Differential Revision: D19247085

Pulled By: mrshenli

fbshipit-source-id: ce6c1710663eecda3641d8dcf80ef16f9d21b93e
2019-12-31 07:25:50 -08:00
b102550d2c Allow to pass in masks through db (#31676)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31676

Facebook:

Previously we assumed mask is passed in as a tensor which is not feasible for sparse parameter.
Here we allow to pass in the mask through db path which requires the masks to be stored in some db first.

Test Plan: unit tests

Reviewed By: ellie-wen

Differential Revision: D18928753

fbshipit-source-id: 75ca894de0f0dcd64ce17b13652484b3550cbdac
2019-12-30 20:54:27 -08:00
39297bfe08 Fix flaky test_debug_info. (#31675)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31675

This test could be flaky since there could be inflight RPC requests as
part of startup which might not have finished. As a result, if they finish
between the different calls to retrieve debug_info, there could be a problem
since we would report separate information. As a result, we wait to ensure
the metrics stabilize to avoid flakiness.
ghstack-source-id: 96188488

Test Plan: waitforbuildbot

Differential Revision: D19242588

fbshipit-source-id: 8f3db7e7365acbd3742e6ec0c2ddcca68f27db9e
2019-12-30 18:07:26 -08:00
f4e955ff62 Change PackSegments to ensure consistent behavior between CPU and GPU
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/31673

Reviewed By: Wakeupbuddy, BIT-silence

Differential Revision: D18925762

fbshipit-source-id: e0c318e97f69b14a54f43c176af57d98fbc16c9f
2019-12-30 13:31:45 -08:00
dd0f2f0c19 add float[] str[] constants (#31503)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31503

Add support for float lists and string lists constants, which enables better constant propagation + constant pooling + freezing.

Test Plan: Imported from OSS

Differential Revision: D19233558

Pulled By: eellison

fbshipit-source-id: 4f7c6d9ddbe7623757a9a20606ce5f394e14e93d
2019-12-30 11:58:17 -08:00
6064223808 @slowTest some slow tests (#31706)
Summary:
These are all the jit tests that take > 10 seconds according to `pytest test/test_jit.py --durations=15`

```
32.76s call     test/test_jit.py::TestModels::test_super_resolution
32.20s call     test/test_jit.py::TestModels::test_neural_style
30.90s call     test/test_jit.py::TestJit::test_export_batchnorm
25.95s call     test/test_jit.py::TestJit::test_dropout_module_requires_grad
22.24s call     test/test_jit.py::TestJitGeneratedModule::test_nn_Transformer
12.38s call     test/test_jit.py::TestScript::test_fuser_double_float_codegen
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31706

Pulled By: driazati

Differential Revision: D19251567

fbshipit-source-id: 8e76f717506b8bf28d1a63ce302feb0446dc9141
2019-12-30 11:45:24 -08:00
ee87b01f40 add additional types to indexing operations dispatch (#31692)
Summary:
- Fixes https://github.com/pytorch/pytorch/issues/31672
- Adds Bfloat16 dispatch to the indexing operations that were missing it
    - index_put on cuda does not have bfloat16 dispatch, because I'm not sure bfloat16 math ops work on cuda

Note: `index_put_` with `accum=True` is enabled for `bool`, which does not make much sense, but I'm not the one who started it, so this behavior is preserved.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31692

Differential Revision: D19249561

Pulled By: ngimel

fbshipit-source-id: 1269196194f7b9f611b32be198c001704731a78f
2019-12-29 23:03:54 -08:00
22d84204f7 Expose torch.poisson in documentation (#31667)
Summary:
Changelog:
- Add doc string for torch.poisson briefing current behavior
- Check for non-positive entries in the tensor passed as input to torch.poisson

Closes https://github.com/pytorch/pytorch/issues/31646
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31667

Differential Revision: D19247371

Pulled By: ngimel

fbshipit-source-id: b53d105e73bf59a45beeb566f47365c3eb74efca
2019-12-28 21:32:26 -08:00
3b7916fccd Modify the order of arguments position of torch.std and torch.std_mean in doc (#31677)
Summary:
Change log:

- [x] Change the order of arguments position of torch.std and torch.std_mean in doc.
- [x] Correct a spelling mistake of torch.std_mean in doc.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31677

Differential Revision: D19247372

Pulled By: ngimel

fbshipit-source-id: 8685f5207c39be524cdc81250430beac9d75f330
2019-12-28 20:36:26 -08:00
e8e47c0a1b Split RRef class into abstract RRef and RRefBase (#28942)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/28942

The new abstract RRef class contains only user-facing RRef APIs.
It will be later moved to a common folder so that it can be shared
by jit and distributed packages to provide TorchScript support.

Test Plan: Imported from OSS

Differential Revision: D18240590

Pulled By: mrshenli

fbshipit-source-id: ac28cfc2c8039ab7131b537b2971ed4738710acb
2019-12-28 20:01:02 -08:00
90a187618e Integrate masked sparse Adagrad (#31641)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31641

Assuming mask is provided as a tensor

Test Plan: unit test

Reviewed By: ellie-wen

Differential Revision: D18928737

fbshipit-source-id: a4f3dd51769c2b56e5890043e91c18e6128be082
2019-12-27 18:40:50 -08:00
ae214f67a5 updated code to ensure error check for negative dims
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/31636

Differential Revision: D19233031

Pulled By: anjali411

fbshipit-source-id: c29265ddd1f887f1a0b98aca56a2691d7584353d
2019-12-27 14:39:57 -08:00
647569e546 get rid of choco install (#30897)
Summary:
7zip and cmake are part of base image, no need to re-install. Remove the install step can make build/test more stable.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30897

Differential Revision: D19232961

Pulled By: mingbowan

fbshipit-source-id: fa3bbd1325839a2a977bf13fdbd97fda43793b8d
2019-12-27 13:12:04 -08:00
35bee0c729 separate op for rowwise counter (#31612)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31612

Count the number recent update on rows. Exponential decay is applied on the counter with decay rate r, such that
    r^{counter_halflife} = 0.5;
If counter_halflife is nonpositive, this operator is turned off.

Test Plan: added unittest

Reviewed By: chocjy

Differential Revision: D19217921

fbshipit-source-id: 96d850123e339212cc0e0ef352ea8a1b1bf61dfa
2019-12-27 12:18:39 -08:00
e84e7ec556 Kill aten_custom_call.
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/25613

Test Plan: Imported from OSS

Differential Revision: D17172503

Pulled By: gchanan

fbshipit-source-id: 1456ecca8f459d008e335412cd7084bdfcb93439
2019-12-27 11:08:42 -08:00
b522a8e1ff Optimize zero length input (#31602)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31602

Pull Request resolved: https://github.com/pytorch/glow/pull/3943

Zero length input is something we hit fairly frequently in practice. Previous handling of global TensorPool involves two locks per input (acquire and reclaim). Here we use a specialized anchor tensor to host zero length input. Note that it is only padded to max sequence length. If necessary, an easy extension can be added to pad to max `InputPlaceholder.getType().size()`.

Reviewed By: jfix71

Differential Revision: D19192467

fbshipit-source-id: cafdc1eb7bf9b9d6ead04a0243b0be838f6b71cd
2019-12-26 22:31:15 -08:00
204939b401 Automatic update of fbcode/onnx to 57ebc587fcf3913b4be93653b0dd58c686447298 (#31642)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31642

Previous import was c08a7b76cf7c1555ae37186f12be4d62b2c39b3b

Included changes:
- **[57ebc587](https://github.com/onnx/onnx/commit/57ebc587)**: python_out does not recognize dllexport_decl. (#2482) <xkszltl>
- **[477a9b87](https://github.com/onnx/onnx/commit/477a9b87)**: Edited PythonAPIOverview.md (#2491) <AlexMuresan>
- **[59b9f908](https://github.com/onnx/onnx/commit/59b9f908)**: Minor correction type (#2411) <Jhuo IH>
- **[cdc8b861](https://github.com/onnx/onnx/commit/cdc8b861)**: fix the optimize pass of fuse_consecutive_transposes (#2471) <XavierAtShanghai>
- **[ad1f5567](https://github.com/onnx/onnx/commit/ad1f5567)**: Add clarification for bias quantization in QlinearConv Op spec (#2464) <Ashwini Khade>
- **[d9a73ccc](https://github.com/onnx/onnx/commit/d9a73ccc)**: Add remove operator and function requirements to the add new op doc. (#2486) <Emad Barsoum>

Test Plan: cont build

Reviewed By: hl475

Differential Revision: D19234753

fbshipit-source-id: 4b7de1407d9b64e584f6e6d68cbe03fa1b4c854d
2019-12-26 21:25:04 -08:00
ffcac9ad37 Clean White List for BC Checks (#31629)
Summary:
Delete obsolete items
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31629

Reviewed By: hl475

Differential Revision: D19231522

Pulled By: houseroad

fbshipit-source-id: 393ed630f7854b643c8fa8c5f3f576718934de96
2019-12-26 21:21:39 -08:00
4983ef8de1 Integrating MaskedAdagrad
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/31640

Test Plan: unit test

Reviewed By: ellie-wen

Differential Revision: D18805278

fbshipit-source-id: 1def4a89b7e4e04385c762bf127d95c5e513180e
2019-12-26 17:18:39 -08:00
Jie
909b8eba0d cudnn grouped convolution nhwc patch (#31444)
Summary:
Earlier cudnn version doesn't support grouped convolution in NHWC well. Legit
configuration in later cudnn version might return CUDNN_STATUS_NOT_SUPPORTED.
We are falling back to NCHW when runtime check of cudnn version is < 7.6.0 to
keep the logic simple.

Note:
We might update the heuristics, 7.6.0 is very conservative.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31444

Differential Revision: D19232414

Pulled By: VitalyFedyunin

fbshipit-source-id: 4c2d79ed347c49cd388bbe5b2684dbfa233eb2a3
2019-12-26 17:16:02 -08:00
39508501a4 Create byte-aware word lstm benchmark (#31260)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31260

1. Update the LiteLM dataset conversion script (fbcode/pytext/fb/tools/lite_lm_dataset_to_tensorproto.py)
2. Created a benchmark json file for byte-aware lstm word model (xplat/aibench/specifications/models/caffe2/assistant/lite_lm_len5.json)
3. In order to run the model -- created an int64 Tensor for the model, added batch gather ops to the BUCK file

Test Plan:
```
1. Create tensorproto of the model input
buck run mode/opt //pytext/fb/tools:byte_lm_dataset_to_tensorproto -- --in-path /mnt/vol/pytext/smart_keyboard/aibench/test_5.txt --out-path /mnt/vol/pytext/smart_keyboard/aibench/byteAwareWordLM/ --hidden_dim 203 --layers_num 2 --max_seq_len 64 --max_byte_len 15

2. Run the aibench command
buck run fbsource//xplat/aibench:run_bench -- -b aibench/specifications/models/caffe2/assistant/lm_byte_lstm_len5.json --remote --devices SM-G960U-8.0.0-26
```

Reviewed By: gardenia22

Differential Revision: D17785682

fbshipit-source-id: 351c3c8bae16449e72ac641522803b23a83349be
2019-12-26 16:44:30 -08:00
91eb7c26cd Fix Typos
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/31630

Differential Revision: D19233162

Pulled By: zou3519

fbshipit-source-id: c2716a2df2b2ccfeda7718b484e9605515ecdf01
2019-12-26 15:47:10 -08:00
34dce8e348 Updating submodules
Summary:
GitHub commits:

a40d608341
50e0ea13e5
bcbdec74f4

Test Plan: n/a

Reviewed By: zpao

fbshipit-source-id: 3de13d5b9b20ec18927ee3f0224df789172a3e9c
2019-12-26 15:06:04 -08:00
ec4e347744 Add Python language reference docs (#30686)
Summary:
This exposes our audit of https://docs.python.org/3/reference/ with descriptions for each line item.

To generate the `.rst` from the Quip:

```bash
pip install m2r
m2r jit_language_reference.md
```

https://driazati.github.io/pytorch_doc_previews/30686/jit.html#python-functions-and-modules
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30686

Pulled By: driazati

Differential Revision: D19219587

fbshipit-source-id: 249db9b5ee20e38804d4302bbfeca7d54f27d0bd
2019-12-26 13:21:36 -08:00
5d95a9ca79 Print all broken ops instead of the first one (#31628)
Summary:
Originally, we only print one broken schema. With this changeset, all the broken schemas are printed out.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31628

Reviewed By: hl475

Differential Revision: D19231444

Pulled By: houseroad

fbshipit-source-id: 3dd5b4609a6a9a9046e95f2f30deb9beeb5dcd56
2019-12-26 12:51:43 -08:00
cf46bcace8 Updating submodules
Summary:
GitHub commits:

faebc336da
23d8703808

Test Plan: n/a

Reviewed By: zpao

fbshipit-source-id: 0368879112c318607821bbf3a081669dade19148
2019-12-26 12:27:04 -08:00
866c1b1fcc Ensure legacy sparse constructor/new doesn't interpret python data as tensor data. (#31490)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31490

When this happens, a dense tensor is constructed from a sparse constructor.

Fixes: https://github.com/pytorch/pytorch/issues/16154

Test Plan: Imported from OSS

Reviewed By: cpuhrsch, mrshenli

Differential Revision: D19196498

Pulled By: gchanan

fbshipit-source-id: 57a6324833e35f3e62318587ac74267077675b93
2019-12-26 10:46:18 -08:00
e2951d586d Updating submodules
Summary:
GitHub commits:

11a904583d

Test Plan: n/a

Reviewed By: zpao

fbshipit-source-id: f00bf65aebddb4541faa2626d42ac436e090ee89
2019-12-26 09:49:33 -08:00
29f345831e Error out if legacy Tensor.new is called on alternate layouts / dtypes (#31485)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31485

Fixes: https://github.com/pytorch/pytorch/issues/22158

Test Plan: Imported from OSS

Differential Revision: D19196499

Pulled By: gchanan

fbshipit-source-id: a01ea7641b5fcd00a9d267243539ff64a5492e5f
2019-12-26 07:27:24 -08:00
a54dc87e8e revert D18805532 and make numerics of masked adagrad consistent with unmasked adagrad (#30784)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30784

Instead of putting experimental Masked*Adagrad to OSS, we decided to change D18805278 .

Test Plan: CI

Reviewed By: chocjy

Differential Revision: D18824265

fbshipit-source-id: 3d893fe6c441f2ff7af4c497cf81b9c49363e7a8
2019-12-24 10:02:13 -08:00
363d8be787 Bypass _TorchScriptTesting_StackString::pop in BC check now (#31586)
Summary:
Failed result: https://circleci.com/gh/pytorch/pytorch/4054919?utm_campaign=vcs-integration-link&utm_medium=referral&utm_source=github-build-link/console

Original PR: https://github.com/pytorch/pytorch/pull/30242
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31586

Reviewed By: hl475

Differential Revision: D19222086

Pulled By: houseroad

fbshipit-source-id: 96db2bf18fa06eaebdd558e86615e26b95f34516
2019-12-23 22:00:20 -08:00
46ad80c839 Fix null pointer dereference on Android for strtod_c (#31582)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31582

D19124934 removed a dummy pointer passed to strtod_c() that's used only for Android (https://fburl.com/diffusion/zkv34jf1). Without it, jit parsing on Android start throwing SIGSEGV due to null pointer dereferencing. This diff adds the dummy pointer back.

Test Plan: Tests

Reviewed By: driazati, shoumikhin

Differential Revision: D19221071

fbshipit-source-id: 2e230c3fbfa873c3f7b92f73c87ee766ac182115
2019-12-23 20:08:13 -08:00
446e9af5b9 Fix parsing of big float literals (#29940)
Summary:
Stacked PRs
 * **#29940 - [jit] Fix parsing of big float literals**
 * #29935 - [jit] Fix hex literal parsing
 * #29931 - [jit] Throw a better error for int too big for int64_t
](https://our.intern.facebook.com/intern/diff/19186604/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/29940

Pulled By: driazati

Differential Revision: D19186604

fbshipit-source-id: 6ef66588a5cf956f281e7bd1e5584ef06f5296e9
2019-12-23 17:21:07 -08:00
218cfd568d Conv transpose/backward split 32bit (#31510)
Summary:
Basically the same as https://github.com/pytorch/pytorch/pull/31379 except for that I write a separate function `split_batch_dim_to_32bit_out` for the logic. This function could also be used for convolution forward, and I will rebase this PR after https://github.com/pytorch/pytorch/issues/31379 get merged and then change `raw_cudnn_convolution_forward_out` to use `split_batch_dim_to_32bit_out` here.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31510

Differential Revision: D19210563

Pulled By: ngimel

fbshipit-source-id: e20bb82b6360aa2c0e449e127188c93f44e1e9b4
2019-12-23 11:34:17 -08:00
fb63c0e2c9 Remove -Wno-unused-private-field
Test Plan: Sanity check

Reviewed By: nlutsenko

Differential Revision: D18833450

fbshipit-source-id: c69b6679b4caa3e868ca41113cd502c8905a776b
2019-12-23 10:59:00 -08:00
68e5172382 Support optional float parameters (float?, optional<double>). (#31517)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31517

This is going to be used by upsample (which currently uses magic values to represent optionals).

For now, we just introduce a fake function for testing (torch._test_optional_float(x)).

Test Plan: Imported from OSS

Differential Revision: D19198721

Pulled By: gchanan

fbshipit-source-id: 0a1382fde0927c5d277d02d62bfb31fb574b8c74
2019-12-23 08:33:39 -08:00
9459db86bf Raise warning for schedulers following chainable shedulers (#31125)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/29697.

Raise warning for schedulers following chainable schedulers in https://github.com/pytorch/pytorch/issues/26423. See explanation for
* [new warning when load/save](https://github.com/pytorch/pytorch/issues/29697#issuecomment-564655802)
* [change from deprecation to user warning](https://github.com/pytorch/pytorch/issues/29697#issuecomment-564659775).

gchanan -- This should go in the upcoming release following https://github.com/pytorch/pytorch/issues/26423.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31125

Differential Revision: D19143740

Pulled By: vincentqb

fbshipit-source-id: 35b55fe6c5b39ca5a68b1a6e19f14eb95b9a784e
2019-12-23 08:24:22 -08:00
fe76af96ed fix test_process_group_debug_info flaky test (#31533)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31533

Fixes this test that was flaky and has been disabled (see
https://github.com/pytorch/pytorch/issues/31112)
ghstack-source-id: 96038999

Test Plan: Run the test 1000 times and ensure that it passes.

Differential Revision: D19203366

fbshipit-source-id: 7978cbb8ca0989a0a370a36349cdd4db3bb8345b
2019-12-22 18:01:21 -08:00
cc2d5ca37f add enabled API to autograd profiler (#31380)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31380

For being able to profile async RPCs, we attach a `RecordFunction` object to the future that is created during the RPC to persist it across the lifetime of the RPC (this is implemented in the next PR: ). Since we'd only like to do this when profiling is enabled, this PR adds an enabled API to the autograd profiler.
ghstack-source-id: 96053933

Test Plan: Modified unit test.

Differential Revision: D19050391

fbshipit-source-id: aa382110e69d06b4a84c83b31d2bec2d8a81ba10
2019-12-22 16:24:59 -08:00
7d630278da Separate torchbind from Python (#30242)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30242

Pull Request resolved: https://github.com/pytorch/pytorch/pull/29501

Currently blocked on schema serialization issue

Test Plan: Imported from OSS

Differential Revision: D18463063

Pulled By: jamesr66a

fbshipit-source-id: c12a1b644eb9bf04e68ff93cccf91d6cb3e75359
2019-12-21 22:52:40 -08:00
700109eb63 set stream everytime when we get a cuDNN handle (#31541)
Summary:
cudnn version of https://github.com/pytorch/pytorch/pull/31537

https://github.com/pytorch/pytorch/pull/31532 is a quick fix and this is a bigger change. This would deprecate https://github.com/pytorch/pytorch/pull/31532, but we could also merge https://github.com/pytorch/pytorch/pull/31532 first for a quick fix and then work on this later.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31541

Differential Revision: D19206753

Pulled By: ngimel

fbshipit-source-id: 3352f923d13a9baf0971f64f8b7ce03e9a8b42b1
2019-12-20 21:34:40 -08:00
b5bbec7bad set stream everytime when we get a cuSparse handle (#31538)
Summary:
cuSparse version of https://github.com/pytorch/pytorch/pull/31537
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31538

Differential Revision: D19206895

Pulled By: ngimel

fbshipit-source-id: a32c0bc310189a89a0098837438d62458b5c0a7c
2019-12-20 21:31:17 -08:00
8d8e82883e set stream everytime when we get a cuBlas handle (#31537)
Summary:
I don't see any reason for not doing so, because it is a common error that people forget to set the stream. And I don't think there is a reason for not running on the current stream.

This is just for cublas, cusparse and cudnn should be modified also.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31537

Differential Revision: D19206908

Pulled By: ngimel

fbshipit-source-id: ba2b2b74e9847f0495c76dbc778751a9f23f8b36
2019-12-20 21:31:13 -08:00
0b0f90f53c Split on batch dimension when 32bit indexing not enough for convolution forward (#31379)
Summary:
Partially fixes https://github.com/pytorch/pytorch/issues/22496

This is just a first step towards the support of 64bit convolution on CUDA. In the forward of convolution, if the total tensor size is larger than 2^31, then we split it on the batch dimension. I want to get some review feedback before moving forward for the same splitting approach for backward.

There are real-world use cases that even when N=1 the input is still larger than 2^31. For this case, the splitting would be complicated, so I am planning to modify `use_cudnn` to just dispatch to the slow fallback kernel in PyTorch in a later PR.

Update: `later PR` is https://github.com/pytorch/pytorch/pull/31383
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31379

Differential Revision: D19192018

Pulled By: ngimel

fbshipit-source-id: c26ecc56319ac67c4d5302ffed246b8d9b5eb972
2019-12-20 21:27:06 -08:00
3820d6f6b9 make gc script python2 compatible (#31536)
Summary:
get rid of f-string, somehow we still have python2
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31536

Differential Revision: D19204187

Pulled By: mingbowan

fbshipit-source-id: da8e17e4dccdd6fd1b0e92eb4740f5a09a8a4209
2019-12-20 16:34:33 -08:00
c808eed04a Nightly dimension, input shape in gradle (#30195)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30195

1. Added flavorDimensions 'build' local/nightly
to be able to test the latest nightlies

```
cls && gradle clean test_app:installMobNet2QuantNightlyDebug -PABI_FILTERS=x86 --refresh-dependencies && adb shell am start -n org.pytorch.testapp.mobNet2Quant/org.pytorch.testapp.MainActivity
```

 2. To be able to change all new model setup editing only `test_app/build.gradle`
 Inlined model asset file names to `build.gradle`

Extracted input tensor shape to `build.gradle` (BuildConfig)

Test Plan: Imported from OSS

Differential Revision: D18893394

Pulled By: IvanKobzarev

fbshipit-source-id: 1fae9989d6f4b02afb42f8e26d0f3261d7ca929b
2019-12-20 16:08:04 -08:00
3a19980b78 Tensor class created from java does not call native methods
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/31520

Test Plan: Imported from OSS

Reviewed By: iseeyuan

Differential Revision: D19199477

Pulled By: IvanKobzarev

fbshipit-source-id: ba51454586a9385dba4ab73936f907346e0105d1
2019-12-20 14:40:54 -08:00
11854bcd38 Add test to torch.jit.export_opnames, make the _C function private
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/31446

Test Plan: Imported from OSS

Differential Revision: D19172851

Pulled By: iseeyuan

fbshipit-source-id: f06d8766ed73c9abe4ebf41c402ee64880d745be
2019-12-20 13:38:43 -08:00
81329c907d Updating submodules
Summary:
GitHub commits:

cbce6d17bb
4762e080cf
174107c0a4
8dee0e0058
ce52b27b4d
f89dea4fec
b269fc595c
5b014c641e
ae2d7e11a2

Test Plan: n/a

Reviewed By: 2d2d2d2d2d

fbshipit-source-id: 252ea5198c3fe4ecfe24e878ea701c48c57618de
2019-12-20 13:35:02 -08:00
35b249769d Exclude lite interpreter Java files from OSS host build
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/31204

Test Plan: Imported from OSS

Differential Revision: D19200610

Pulled By: dreiss

fbshipit-source-id: 0cf41c99b4c2604afc2dccfebbea213c0e1f9638
2019-12-20 13:32:27 -08:00
08de70cad1 Remove observers in the end (#31407)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31407

Remove observers in the end instead of before quantize tensor
since we still need them to find the quantization paramters for each module instance

Test Plan:
.

Imported from OSS

Differential Revision: D19162367

fbshipit-source-id: f817af87183f6c42dc97becea85ddeb7e050e2b1
2019-12-20 13:17:26 -08:00
b4c48b7e29 Call getQSchemeAndQParamMap later in quantizeTensors (#31406)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31406

Previously we record quantization parameters for a given value when we collect the observer nodes,
but actually the quantization parameter can vary depending on each module instance, to achieve
that, we need to delay the call to later stage and only record the `Value*` that's needed
in `collectObserverNodesAndValueToQuantize` function

Test Plan:
.

Imported from OSS

Differential Revision: D19162369

fbshipit-source-id: e0f97e322d18a281bf15b6c7bbb04c3dfacb512f
2019-12-20 13:17:21 -08:00
df9d5b8a77 Use macros instead of directly accessing Python object fields (#31388)
Summary:
The Python C API documentation states "Access to the [PyObject]
members must be done by using the macros Py_REFCNT and Py_TYPE."
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31388

Differential Revision: D19161790

Pulled By: colesbury

fbshipit-source-id: ac9a3738c913ad290a6d3460d0d657ec5c13b711
2019-12-20 12:11:17 -08:00
5375ceae80 run optimizations on pre-profiled graph (#31392)
Summary:
This is the first stab at running profile-insensitive optimizations on pre-profiled graphs. Running those optimizations has a potential to simplify graphs greatly before GuardElimination and GuardElimination should be able to remove more guards.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31392

Differential Revision: D19173639

Pulled By: Krovatkin

fbshipit-source-id: 2485a2a598c10f9b5445efb30b16439ad4551b3f
2019-12-20 10:49:08 -08:00
256db1e61b Add fake parsing for torchbind classes in schema type parser
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/31506

Test Plan: Imported from OSS

Differential Revision: D19187722

Pulled By: jamesr66a

fbshipit-source-id: 4529409454d64393a821b8fa795db39bc82da8fc
2019-12-20 10:28:57 -08:00
7a12ccd003 optimize FloatToFused8BitRowwiseQuantized and Fused8BitRowwiseQuantizedToFloat (#31470)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31470

Optimize performance of these two operators.
Additionally use nearbyint instead of round to be consistent with 4-bit embedding table quantization.

Reviewed By: hyuen

Differential Revision: D19072103

fbshipit-source-id: efe96f14aeff7958cceb453ed625d3fd693891ff
2019-12-20 10:09:26 -08:00
0b57b383b1 Im2col export (#30972)
Summary:
Added im2col to opset 11.
This symbolic is used to export torch.nn.Unfold
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30972

Reviewed By: hl475

Differential Revision: D18946921

Pulled By: houseroad

fbshipit-source-id: 13dd0cbae899700df32fd74d6dff1f29033a2b4c
2019-12-20 09:45:45 -08:00
6cd987e7c0 Make fully_qualified_type_name_impl() compatible with VS2017 15.9 (#31455)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31455

In 15.9, __FUNCSIG__ unwraps using definitions as well as preserves noexcept qualifiers

Test Plan: Build caffe2 on Windows using VS2017

Differential Revision: D19166204

fbshipit-source-id: b6c5f70e5262d13adf585f77b92223cf5f1e78dd
2019-12-20 09:17:44 -08:00
2099cfa13d Fix input_channels divisibility check in concat_split_op (#31448)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31448

Replace `(!x%y)` with `(x%y != 0)`

Test Plan: CI

Reviewed By: orionr

Differential Revision: D19165492

fbshipit-source-id: 246635fb8ddd5823196bcef9d0e6cdf1c349015e
2019-12-20 09:12:54 -08:00
b38901aa15 Test reading __cuda_array_interface__ inferred strides. (#31451)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31451

The PR that fixed this, https://github.com/pytorch/pytorch/pull/24947, didn't add a test.

Fixes: https://github.com/pytorch/pytorch/issues/31443

Test Plan: Imported from OSS

Differential Revision: D19170020

Pulled By: gchanan

fbshipit-source-id: bdbf09989ac8a61b1b70bb1ddee103caa8ef435b
2019-12-20 08:21:39 -08:00
d0d6e0b5e3 add type promotion support for sparse tensors (#30429)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30429

also fix a bug in uncoalesced division

General approach here is that we:
* compute the common dtype based on input tensors
* error if the output tensor is specified and the common type can't be cast back to the output type (e.g. for inplace ops)
* convert input tensor (values) to the common dtype
* perform the op as normal (computing at the common dtype instead of the result type).
* convert/copy the result values back to that of the result tensor (for in-place ops).

For uncoalesced division we need to coalesce, because an integral tensor with values=[1,1] at the same index divided by 2 would give 1/2 + 1/2 =0 instead of 2/2=1.

Test Plan: Imported from OSS

Differential Revision: D19143223

Pulled By: nairbv

fbshipit-source-id: 480fa334c0b2b3df046818f2342cfd4e2d9d892a
2019-12-20 08:01:00 -08:00
e9ef087d2d Updating submodules
Summary:
GitHub commits:

357842e091
d62f47c763
dc94cd4972

Test Plan: n/a

Reviewed By: 2d2d2d2d2d

fbshipit-source-id: dcb9813e1469cc867d9c826daa873c535ef408ab
2019-12-20 00:57:39 -08:00
4c341582ea modify model to enable loading by blob (#31507)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31507

This script is used to generate a model with bound shape inference and
blob reorder, which are requirements for big model loading on T17.
1. Load existing model.
2. Do bound shape inference and blob reorder (put embedding blobs at the end).
3. Save the modified model.

Test Plan:
Generated a new moel and tested on NNPI.
P124181047 (mismatch is AA variance)

Reviewed By: ipiszy

Differential Revision: D19165467

fbshipit-source-id: c3522fc5dc53b7ec652420558e9e8bf65a1ccfae
2019-12-19 21:57:22 -08:00
06dbef663d Add support for del (#31273)
Summary:
Adds the `del` keyword to the parser and corresponding `aten::Delete` op for lists and dicts

Fixes #20615
](https://our.intern.facebook.com/intern/diff/19181473/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31273

Pulled By: driazati

Differential Revision: D19181473

fbshipit-source-id: c42a2d43ec361a98e0c425232981edc9c39388c4
2019-12-19 21:48:11 -08:00
624088e444 Don't dispatch to cudnn if it is not possible to make it 32bit by splitting batch dim (#31383)
Summary:
Also a step towards supporting 64bit indexing in convolution.

See also: https://github.com/pytorch/pytorch/pull/31379
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31383

Differential Revision: D19183443

Pulled By: ngimel

fbshipit-source-id: 0c2030fac147e629d7be0c29f0683ec2b3f28c71
2019-12-19 18:00:03 -08:00
87768e5ade Updating submodules
Summary:
GitHub commits:

286867987e
09cbf47ea5
db100834c1
1ba92b8582
60240e3f08
beb5c4798e
c37eb5d377
1ada29037c
f12539bbc9

Test Plan: n/a

Reviewed By: 2d2d2d2d2d

fbshipit-source-id: 75b16ea1bc038b599b3540d0615dd9eb9ecfda74
2019-12-19 17:30:48 -08:00
457286a383 fix missing type check in dictionary literal
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/31375

Test Plan: Imported from OSS

Differential Revision: D19145440

Pulled By: zdevito

fbshipit-source-id: 69909089586149ef766b4858d3420864a81b2493
2019-12-19 16:22:36 -08:00
348d42114e Kill MessageType::SHUTDOWN related logic in pg agent (#31270)
Summary:
https://github.com/pytorch/pytorch/pull/30330 got rid of the need to send a `MessageType::SHUTDOWN` message, so we can now remove the logic/utils for this type of message.

I think we can also delete the enum entry in the `enum MessageType`, but we may want to keep it in case the logic in https://github.com/pytorch/pytorch/pull/30710 is ever moved to C++.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31270

Test Plan: All existing unit tests pass

Differential Revision: D19146983

Pulled By: rohan-varma

fbshipit-source-id: 35b185411f9446d7d4dfc37a6cb5477cf041e647
2019-12-19 13:47:43 -08:00
57caeb3fc1 Fix builtins table (#31492)
Summary:
Fixes a bad merge that is breaking distributed tests on master
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31492

Pulled By: driazati

Differential Revision: D19180978

fbshipit-source-id: f69f525e2c7f61194686f07cf75db00eb642882f
2019-12-19 13:33:15 -08:00
226c2d79ce Get QScheme from observer module (#31293)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31293

Previously we check the number of elements in scale to determine if we are using per channel quantization,
but we should get qscheme information from observer module directly and we'll expose this information
to caller as well

Test Plan:
.

Imported from OSS

Differential Revision: D19146669

fbshipit-source-id: ea430eeae0ef8f441be39aa6dcc1bb530b065554
2019-12-19 13:33:11 -08:00
dbe2f265d0 Better error msg for autograd profiler + multi-worker dataloader crash (#31473)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31473

Mitigates #6313

A common use case for the autograd profiler is to use it to run over an
entire model, including dataloading. The following will crash:
- run autograd profiler in CUDA mode
- Use a multi-worker DataLoader (presumably with the 'fork' spawn
method)
- because the autograd profiler initializes CUDA and forking after CUDA is
initialized is bad.

This PR puts in a nice error message when this happens so that users
aren't too confused. The new error message looks like:
https://gist.github.com/zou3519/903f15c3e86bad4585b7e5ce14cc1b70

Test Plan:
- Tested locally.
- I didn't add a test case for this because it's hard to write a test
case that doesn't completely stop the rest of our test suite from
running.

Differential Revision: D19178080

Pulled By: zou3519

fbshipit-source-id: c632525ba1f7b168324f1aa55416e5250f56a086
2019-12-19 13:30:19 -08:00
e67064a96f Exclude generated source docs from Google (#31484)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31484

See https://github.com/pytorch/pytorch/issues/26123 for context.

Previously, when someone googles for `pytorch "adaptive_max_pool2d"`,
https://pytorch.org/docs/stable/_modules/torch/nn/modules/pooling.html
is the first result. This PR changes the docs build script to exclude
all such generated source docs under `_modules/` from Google.

It does this by doing a search for `<head>` and then appending
`<meta name="robots" content="noindex">`.
The [google developer
docs](https://support.google.com/webmasters/answer/93710?hl=en) suggest
that this is the right way to prevent google from indexing the page.

In the future, when the CI
builds documentation (both master and stable docs), the newly created
docs under _modules will have the meta noindex tag.

Test Plan:
- I ran `find "$install_path/_modules" -name "*.html" -print0 | xargs -0
sed -i '/<head>/a \ \ <meta name="robots" content="noindex">'` on a docs
build locally and checked that it does indeed append the meta noindex
tag after `<head>`.
- In a few days we should rerun the search to see if these pages are
still being indexed.

Differential Revision: D19180300

Pulled By: zou3519

fbshipit-source-id: 5f5aa95a85dd9f065607c2a16f4cdd24ed699a83
2019-12-19 13:27:12 -08:00
8f3c0d541e Speed up Tensor::has_names for unnamed tensors (#31436)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31436

Tensor::has_names is slower than it should be for unnamed tensors
because of the following:
- it always tries to access the TLS for NamesMode. Unnamed tensors don't
need to peek at NamesMode to determine if they have names or not.
- There is some virtual function being called because TensorImpl is in
c10 and NamedTensorMeta is in libtorch.

This PR short-circuits Tensor::has_names for unnamed tensors by
checking if the underlying TensorImpl hold a pointer to NamedTensorMeta
or not. If the NamedTensorMeta is nullptr; then the tensor is definitely
unnamed.

Benchmarks:
- I have a dedicated benchmarking machine where I isolate a single CPU
and make sure it runs at a fixed frequency.
- I benchmarked torch.add, which calls `tensor::has_names` three times.
- The TL;DR is that torch.add between size-1 unnamed tensors gets sped up
~200ns after this change which is a 9% improvement.
- Before, on my machine:
https://gist.github.com/zou3519/dfd648a1941d584711d850754e0694bc
- After on my machine:
https://gist.github.com/zou3519/e78f0d8980b43d0d9c3e3e78ecd0d4d5

Test Plan: - run tests

Differential Revision: D19166510

Pulled By: zou3519

fbshipit-source-id: 1888a4e92d29152a5e3b778a95e531087e532f53
2019-12-19 13:19:30 -08:00
9d9bc93bfb Added error message to indicate that reduction operations are not supported for dim>=64 (#31476)
Summary:
Reference: https://github.com/pytorch/pytorch/issues/23159
Currently we don't support reduction operations for dim>=64 and we should give a descriptive RuntimeError indicating the same
Diff: D19179039
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31476

Differential Revision: D19179039

Pulled By: anjali411

fbshipit-source-id: 58568f64627bf3df6b3e00a1498544c030e74a0e
2019-12-19 13:00:53 -08:00
779b128872 add back in reference to jit_unsupported section (#31486)
Summary:
It was added in https://github.com/pytorch/pytorch/pull/31329 and removed in a bad merge in https://github.com/pytorch/pytorch/pull/31138/
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31486

Differential Revision: D19181967

Pulled By: eellison

fbshipit-source-id: 7e4b4a9b2042c30ec18f7f737bc4a9a56fac7d92
2019-12-19 12:44:16 -08:00
49fe7a7401 Updated documentation for NLLLoss to explain what x, y and w refer to (#31488)
Summary:
Reference: https://github.com/pytorch/pytorch/issues/31385

In the current documentation for NLLLoss, it's unclear what `y` refers to in the math section of the loss description. There was an issue(https://github.com/pytorch/pytorch/issues/31295) filed earlier where there was a confusion if the loss returned for reduction=mean is right or not, perhaps because of lack in clarity of formula symbol description in the current documentation.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31488

Differential Revision: D19181391

Pulled By: anjali411

fbshipit-source-id: 8b75f97aef93c92c26ecbce55b3faf2cd01d3e74
2019-12-19 12:28:16 -08:00
d6acc87c93 Guard against copying from quantized Tensor to non-quantized Tensor (#29660)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/29660

att

Test Plan:
python test/test_quantized_tensor.py

Imported from OSS

Differential Revision: D18799897

fbshipit-source-id: 5d1b4ef84f5ae8eba830784b74485d78fa1e6fcf
2019-12-19 12:16:44 -08:00
c4121ed8db Fix is_fundamental template for MSVC (#30959)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/30932
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30959

Differential Revision: D18891797

Pulled By: mingbowan

fbshipit-source-id: e6c36ee80065e66117873e768f86f507c48aaef1
2019-12-19 12:10:22 -08:00
6d6a91fb0f Updating submodules
Summary:
GitHub commits:

58a1ec274c
24da1c8b66
77d5ba7887
c7b80d7ab5

Test Plan: n/a

Reviewed By: tgreenidge

fbshipit-source-id: be872df9014b795b279b93bd81efbaa41f2d0fd7
2019-12-19 12:05:29 -08:00
28376e826d Fix lint
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/31463

Pulled By: driazati

Differential Revision: D19173580

fbshipit-source-id: 6e5bb24949ec357c4d5b29a16d1733b664f21e05
2019-12-19 10:17:01 -08:00
540b9da41e Bump numba version in circleCI config to 0.46.0. (#31435)
Summary:
The current numba version doesn't appear to actually work with our numba-cuda tests (numba.cuda.is_available()) fails.

Previous attempts to upgrade were blocked by https://github.com/numba/numba/issues/4368.

It's a bit unclear to me, but I believe 0.46.0 fixes the above version.  I'm verify that we catch that issue in CI via https://github.com/pytorch/pytorch/pull/31434.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31435

Differential Revision: D19166865

Pulled By: gchanan

fbshipit-source-id: e01fa48c577e35de178423db7a7f79ac3dd3894d
2019-12-19 07:55:55 -08:00
fc3103b116 fixing a naming issue in creating a residual loop node in a bailout graph (#31400)
Summary:
This addresses the issue of differentiating between `%4` in
`%12 : int, %y.1 : Tensor = prim::Loop(%9, %6, %4, %3)` and `%y.5 : Double(3) = aten::cat(%22, %4) # test_jit.py:3772:24` in `%4` loop's body in a residual continuation loop, because these should be different values.

```
[DUMP profiling_graph_executor_impl.cpp:124] with prim::BailoutTemplate_0 = graph(%z.1 : int,
[DUMP profiling_graph_executor_impl.cpp:124]       %size.1 : int):
[DUMP profiling_graph_executor_impl.cpp:124]   %2 : Tensor = prim::Constant[value= 1  1 [ CPUDoubleType{2} ]]()
[DUMP profiling_graph_executor_impl.cpp:124]   %3 : Double(2) = prim::BailOut[index=0](%2, %z.1, %size.1)
[DUMP profiling_graph_executor_impl.cpp:124]   %4 : int = prim::Constant[value=0]() # test_jit.py:3772:54
[DUMP profiling_graph_executor_impl.cpp:124]   %5 : None = prim::Constant()
[DUMP profiling_graph_executor_impl.cpp:124]   %6 : bool = prim::Constant[value=1]() # test_jit.py:3770:16
[DUMP profiling_graph_executor_impl.cpp:124]   %counters.1 : int[] = prim::ListConstruct()
[DUMP profiling_graph_executor_impl.cpp:124]   %8 : int = prim::Constant[value=8]()
[DUMP profiling_graph_executor_impl.cpp:124]   %9 : int = aten::__round_to_zero_floordiv(%size.1, %8)
[DUMP profiling_graph_executor_impl.cpp:124]   %10 : int = aten::mul(%9, %8)
[DUMP profiling_graph_executor_impl.cpp:124]   %11 : int = aten::sub(%size.1, %10)
[DUMP profiling_graph_executor_impl.cpp:124]   %12 : int, %y.1 : Tensor = prim::Loop(%9, %6, %4, %3) # test_jit.py:3770:16
[DUMP profiling_graph_executor_impl.cpp:124]     block0(%i.2 : int, %15 : int, %y.7 : Tensor):
[DUMP profiling_graph_executor_impl.cpp:124]       %17 : Double(2) = prim::BailOut[index=1](%y.7, %z.1, %counters.1, %9, %11, %i.2, %15)
[DUMP profiling_graph_executor_impl.cpp:124]       %18 : int[] = aten::append(%counters.1, %15) # test_jit.py:3771:20
[DUMP profiling_graph_executor_impl.cpp:124]       %19 : int[] = prim::ListConstruct(%z.1)
[DUMP profiling_graph_executor_impl.cpp:124]       %20 : Tensor = aten::ones(%19, %5, %5, %5, %5) # test_jit.py:3772:38
[DUMP profiling_graph_executor_impl.cpp:124]       %21 : Double(1) = prim::BailOut[index=2](%20, %z.1, %counters.1, %9, %11, %i.2, %15, %17)
[DUMP profiling_graph_executor_impl.cpp:124]       %22 : Tensor[] = prim::ListConstruct(%17, %21)
[DUMP profiling_graph_executor_impl.cpp:124]       %y.5 : Double(3) = aten::cat(%22, %4) # test_jit.py:3772:24
[DUMP profiling_graph_executor_impl.cpp:124]       %24 : int = prim::Constant[value=1]()
[DUMP profiling_graph_executor_impl.cpp:124]       %25 : int = aten::add(%15, %24)
[DUMP profiling_graph_executor_impl.cpp:124]       %26 : int[] = aten::append(%counters.1, %25) # test_jit.py:3771:20
[DUMP profiling_graph_executor_impl.cpp:124]       %27 : int[] = prim::ListConstruct(%z.1)
[DUMP profiling_graph_executor_impl.cpp:124]       %28 : Tensor = aten::ones(%27, %5, %5, %5, %5) # test_jit.py:3772:38
[DUMP profiling_graph_executor_impl.cpp:124]       %29 : Double(1) = prim::BailOut[index=3](%28, %z.1, %counters.1, %9, %11, %i.2, %y.5, %25)
[DUMP profiling_graph_executor_impl.cpp:124]       %30 : Tensor[] = prim::ListConstruct(%y.5, %29)
[DUMP profiling_graph_executor_impl.cpp:124]       %y.9 : Double(4) = aten::cat(%30, %4) # test_jit.py:3772:24
[DUMP profiling_graph_executor_impl.cpp:124]       %32 : int = aten::add(%25, %24)
[DUMP profiling_graph_executor_impl.cpp:124]       %33 : int[] = aten::append(%counters.1, %32) # test_jit.py:3771:20
[DUMP profiling_graph_executor_impl.cpp:124]       %34 : int[] = prim::ListConstruct(%z.1)
[DUMP profiling_graph_executor_impl.cpp:124]       %35 : Tensor = aten::ones(%34, %5, %5, %5, %5) # test_jit.py:3772:38
[DUMP profiling_graph_executor_impl.cpp:124]       %36 : Double(1) = prim::BailOut[index=4](%35, %z.1, %counters.1, %9, %11, %i.2, %y.9, %32)
[DUMP profiling_graph_executor_impl.cpp:124]       %37 : Tensor[] = prim::ListConstruct(%y.9, %36)
[DUMP profiling_graph_executor_impl.cpp:124]       %y.10 : Double(5) = aten::cat(%37, %4) # test_jit.py:3772:24
[DUMP profiling_graph_executor_impl.cpp:124]       %39 : int = aten::add(%32, %24)
[DUMP profiling_graph_executor_impl.cpp:124]       %40 : int[] = aten::append(%counters.1, %39) # test_jit.py:3771:20
[DUMP profiling_graph_executor_impl.cpp:124]       %41 : int[] = prim::ListConstruct(%z.1)
[DUMP profiling_graph_executor_impl.cpp:124]       %42 : Tensor = aten::ones(%41, %5, %5, %5, %5) # test_jit.py:3772:38
[DUMP profiling_graph_executor_impl.cpp:124]       %43 : Double(1) = prim::BailOut[index=5](%42, %z.1, %counters.1, %9, %11, %i.2, %y.10, %39)
[DUMP profiling_graph_executor_impl.cpp:124]       %44 : Tensor[] = prim::ListConstruct(%y.10, %43)
[DUMP profiling_graph_executor_impl.cpp:124]       %y.11 : Double(6) = aten::cat(%44, %4) # test_jit.py:3772:24
[DUMP profiling_graph_executor_impl.cpp:124]       %46 : int = aten::add(%39, %24)
[DUMP profiling_graph_executor_impl.cpp:124]       %47 : int[] = aten::append(%counters.1, %46) # test_jit.py:3771:20
[DUMP profiling_graph_executor_impl.cpp:124]       %48 : int[] = prim::ListConstruct(%z.1)
[DUMP profiling_graph_executor_impl.cpp:124]       %49 : Tensor = aten::ones(%48, %5, %5, %5, %5) # test_jit.py:3772:38
[DUMP profiling_graph_executor_impl.cpp:124]       %50 : Double(1) = prim::BailOut[index=6](%49, %z.1, %counters.1, %9, %11, %i.2, %y.11, %46)
[DUMP profiling_graph_executor_impl.cpp:124]       %51 : Tensor[] = prim::ListConstruct(%y.11, %50)
[DUMP profiling_graph_executor_impl.cpp:124]       %y.12 : Double(7) = aten::cat(%51, %4) # test_jit.py:3772:24
[DUMP profiling_graph_executor_impl.cpp:124]       %53 : int = aten::add(%46, %24)
[DUMP profiling_graph_executor_impl.cpp:124]       %54 : int[] = aten::append(%counters.1, %53) # test_jit.py:3771:20
[DUMP profiling_graph_executor_impl.cpp:124]       %55 : int[] = prim::ListConstruct(%z.1)
[DUMP profiling_graph_executor_impl.cpp:124]       %56 : Tensor = aten::ones(%55, %5, %5, %5, %5) # test_jit.py:3772:38
[DUMP profiling_graph_executor_impl.cpp:124]       %57 : Double(1) = prim::BailOut[index=7](%56, %z.1, %counters.1, %9, %11, %i.2, %y.12, %53)
[DUMP profiling_graph_executor_impl.cpp:124]       %58 : Tensor[] = prim::ListConstruct(%y.12, %57)
[DUMP profiling_graph_executor_impl.cpp:124]       %y.13 : Double(8) = aten::cat(%58, %4) # test_jit.py:3772:24
[DUMP profiling_graph_executor_impl.cpp:124]       %60 : int = aten::add(%53, %24)
[DUMP profiling_graph_executor_impl.cpp:124]       %61 : int[] = aten::append(%counters.1, %60) # test_jit.py:3771:20
[DUMP profiling_graph_executor_impl.cpp:124]       %62 : int[] = prim::ListConstruct(%z.1)
[DUMP profiling_graph_executor_impl.cpp:124]       %63 : Tensor = aten::ones(%62, %5, %5, %5, %5) # test_jit.py:3772:38
[DUMP profiling_graph_executor_impl.cpp:124]       %64 : Double(1) = prim::BailOut[index=8](%63, %z.1, %counters.1, %9, %11, %i.2, %y.13, %60)
[DUMP profiling_graph_executor_impl.cpp:124]       %65 : Tensor[] = prim::ListConstruct(%y.13, %64)
[DUMP profiling_graph_executor_impl.cpp:124]       %y.14 : Double(9) = aten::cat(%65, %4) # test_jit.py:3772:24
[DUMP profiling_graph_executor_impl.cpp:124]       %67 : int = aten::add(%60, %24)
[DUMP profiling_graph_executor_impl.cpp:124]       %68 : int[] = aten::append(%counters.1, %67) # test_jit.py:3771:20
[DUMP profiling_graph_executor_impl.cpp:124]       %69 : int[] = prim::ListConstruct(%z.1)
[DUMP profiling_graph_executor_impl.cpp:124]       %70 : Tensor = aten::ones(%69, %5, %5, %5, %5) # test_jit.py:3772:38
[DUMP profiling_graph_executor_impl.cpp:124]       %71 : Double(1) = prim::BailOut[index=9](%70, %z.1, %counters.1, %9, %11, %i.2, %y.14, %67)
[DUMP profiling_graph_executor_impl.cpp:124]       %72 : Tensor[] = prim::ListConstruct(%y.14, %71)
[DUMP profiling_graph_executor_impl.cpp:124]       %y.15 : Tensor = aten::cat(%72, %4) # test_jit.py:3772:24
[DUMP profiling_graph_executor_impl.cpp:124]       %74 : int = aten::add(%67, %24)
[DUMP profiling_graph_executor_impl.cpp:124]       -> (%6, %74, %y.15)
[DUMP profiling_graph_executor_impl.cpp:124]   %75 : Double(10) = prim::BailOut[index=10](%y.1, %z.1, %counters.1, %11, %12)
[DUMP profiling_graph_executor_impl.cpp:124]   %76 : int, %y : Tensor = prim::Loop(%11, %6, %12, %75) # test_jit.py:3770:16
[DUMP profiling_graph_executor_impl.cpp:124]     block0(%i.1 : int, %79 : int, %y.6 : Tensor):
[DUMP profiling_graph_executor_impl.cpp:124]       %81 : Double(*) = prim::BailOut[index=11](%y.6, %z.1, %counters.1, %11, %i.1, %79)
[DUMP profiling_graph_executor_impl.cpp:124]       %82 : int[] = aten::append(%counters.1, %79) # test_jit.py:3771:20
[DUMP profiling_graph_executor_impl.cpp:124]       %83 : int[] = prim::ListConstruct(%z.1)
[DUMP profiling_graph_executor_impl.cpp:124]       %84 : Tensor = aten::ones(%83, %5, %5, %5, %5) # test_jit.py:3772:38
[DUMP profiling_graph_executor_impl.cpp:124]       %85 : Double(1) = prim::BailOut[index=12](%84, %counters.1, %11, %i.1, %79, %81)
[DUMP profiling_graph_executor_impl.cpp:124]       %86 : Tensor[] = prim::ListConstruct(%81, %85)
[DUMP profiling_graph_executor_impl.cpp:124]       %y.4 : Tensor = aten::cat(%86, %4) # test_jit.py:3772:24
[DUMP profiling_graph_executor_impl.cpp:124]       %88 : int = prim::Constant[value=1]()
[DUMP profiling_graph_executor_impl.cpp:124]       %89 : int = aten::add(%79, %88)
[DUMP profiling_graph_executor_impl.cpp:124]       -> (%6, %89, %y.4)
[DUMP profiling_graph_executor_impl.cpp:124]   %90 : Double(12) = prim::BailOut[index=13](%y, %counters.1)
[DUMP profiling_graph_executor_impl.cpp:124]   %91 : (Tensor, int[]) = prim::TupleConstruct(%90, %counters.1)
[DUMP profiling_graph_executor_impl.cpp:124]   return (%91)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31400

Differential Revision: D19172750

Pulled By: Krovatkin

fbshipit-source-id: 85d3aac4e80b65b83b6be3c0bca8075a731a2b7e
2019-12-19 00:34:50 -08:00
1e116a5089 Revert D19054937: Add support for del
Test Plan: revert-hammer

Differential Revision:
D19054937

Original commit changeset: c535ea16a9e6

fbshipit-source-id: e57d31811441947b7ee38c8c2b16eecde5005792
2019-12-18 22:39:41 -08:00
489dd6cb90 Add TORCH_DCHECK macro that checks only in debug builds (#31240)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31240

Follow up on discoveries/discussions in https://github.com/pytorch/pytorch/pull/30810

Mimic the `DCHECK` macro from https://github.com/pytorch/pytorch/blob/e5eb871/c10/util/logging_is_not_google_glog.h#L117-L125

With this change the perf gap is eliminated:

```
================================================================================
Program Output:
================================================================================
Run on (36 X 1601 MHz CPU s)
2019-12-12 20:12:13
-----------------------------------------------------------------
Benchmark                          Time           CPU Iterations
-----------------------------------------------------------------
BM_IntrusivePtrCtorDtor           23 ns         23 ns   30914703
BM_SharedPtrCtorDtor              27 ns         27 ns   25895944
BM_IntrusivePtrArray/16          503 ns        503 ns    1392139
BM_IntrusivePtrArray/32         1006 ns       1006 ns     695749
BM_IntrusivePtrArray/64         2013 ns       2013 ns     347714
BM_IntrusivePtrArray/128        4024 ns       4024 ns     173964
BM_IntrusivePtrArray/256        8047 ns       8047 ns      86994
BM_IntrusivePtrArray/512       16106 ns      16106 ns      43461
BM_IntrusivePtrArray/1024      32208 ns      32207 ns      21731
BM_IntrusivePtrArray/2048      64431 ns      64430 ns      10865
BM_IntrusivePtrArray/4096     128940 ns     128938 ns       5429
BM_SharedPtrArray/16             503 ns        503 ns    1392128
BM_SharedPtrArray/32            1006 ns       1006 ns     695940
BM_SharedPtrArray/64            2012 ns       2012 ns     347817
BM_SharedPtrArray/128           4024 ns       4023 ns     173927
BM_SharedPtrArray/256           8069 ns       8069 ns      86741
BM_SharedPtrArray/512          16143 ns      16142 ns      43357
BM_SharedPtrArray/1024         32283 ns      32283 ns      21685
BM_SharedPtrArray/2048         64718 ns      64717 ns      10817
BM_SharedPtrArray/4096        129469 ns     129466 ns       5407
================================================================================
```
```
================================================================================
Program Output:
================================================================================
Run on (80 X 2001 MHz CPU s)
2019-12-12 20:12:23
-----------------------------------------------------------------
Benchmark                          Time           CPU Iterations
-----------------------------------------------------------------
BM_IntrusivePtrCtorDtor           18 ns         18 ns   38630411
BM_SharedPtrCtorDtor              22 ns         22 ns   32356114
BM_IntrusivePtrArray/16          402 ns        402 ns    1739637
BM_IntrusivePtrArray/32          805 ns        805 ns     869818
BM_IntrusivePtrArray/64         1610 ns       1609 ns     434881
BM_IntrusivePtrArray/128        3218 ns       3218 ns     217437
BM_IntrusivePtrArray/256        6436 ns       6436 ns     108739
BM_IntrusivePtrArray/512       12882 ns      12882 ns      54356
BM_IntrusivePtrArray/1024      25763 ns      25763 ns      27177
BM_IntrusivePtrArray/2048      51532 ns      51531 ns      13590
BM_IntrusivePtrArray/4096     103091 ns     103091 ns       6778
BM_SharedPtrArray/16             402 ns        402 ns    1740165
BM_SharedPtrArray/32             804 ns        804 ns     869035
BM_SharedPtrArray/64            1610 ns       1610 ns     434975
BM_SharedPtrArray/128           3218 ns       3218 ns     217505
BM_SharedPtrArray/256           6457 ns       6457 ns     108510
BM_SharedPtrArray/512          12909 ns      12909 ns      54249
BM_SharedPtrArray/1024         25810 ns      25810 ns      27127
BM_SharedPtrArray/2048         51763 ns      51763 ns      13531
BM_SharedPtrArray/4096        103506 ns     103505 ns       6759
================================================================================
```

Test Plan:
buck test caffe2/c10/...
buck test mode/opt caffe2/c10/...

Differential Revision: D18998243

fbshipit-source-id: ddf0a118a80efe032b52d403867c1f416c721590
2019-12-18 21:55:58 -08:00
fb24f7c4ad catch all exceptions in converting default values to ivalues (#31398)
Summary:
Previously we would only catch `py::cast_error` which led to incomprehensible error messages like: `TypeError: 'NoneType' object is not iterable`. We are running arbitrary pybind code here, and not doing anything with the error message, so we should be less restrictive with the types of errors we catch.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31398

Differential Revision: D19166655

Pulled By: eellison

fbshipit-source-id: 84db8b3714c718b475913f2f4bb6f19e62f2d9ec
2019-12-18 20:27:46 -08:00
1bb6c51421 Fix getAttribute (#31011)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31011

`getAttribute` is supposed to throw when there the attribute is not
found rather than return a `nullptr`.

Test Plan:
.

Imported from OSS

Differential Revision: D18898417

fbshipit-source-id: 0fe7d824b978ad19bb5ef094d3aa560e9fc57f87
2019-12-18 19:27:39 -08:00
dff7b945bf Avoid sending large unneeded data over wire in process_group_agent. (#31357)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31357

If a user selects a subset of a Tensor and sends it in an RPC, we were sending
the whole original Tensor Storage over the network.

While this sounds reasonable, in practice, we observed view-like Tensors being sent
over rpc, where only 1% of the data in the provided Tensor's Storage was
actually used/needed.

The simple solution here is to just force a clone in the serializer code if we see that
less than (arbitrary) half the bits are used, and the tensor is more than a nominal few KB.
Add related tests to ensure this doesn't break.

An alternate approach would be to modify the Pickler. That said, since Pickler is shared by more
components, the logic might be harder to tailor appropriately at that layer (particularly
given that the Pickler has explicit logic to share a single Storage* among several Tensors
that commonly point to the same Storage*).

It's possible that we might want to further refine the basic thresholds in this change.
In practice, we've seen a mostly bimodal distribution thus far for the percent of Tensor
Storage referred by a Tensor in observed rpcs (i.e. either 90%+ or sub-10% of the Storage
referenced), hence the existing 50% threshold here is probably not an unreasonable
starting point.
ghstack-source-id: 95925474

Test Plan: buck test mode/dev caffe2/test/cpp/rpc/...

Differential Revision: D19137056

fbshipit-source-id: e2b3a4dd0cc6e1de820fd0740aa1d59883dbf8d4
2019-12-18 19:24:24 -08:00
1bb800cf5c Updating submodules
Summary:
GitHub commits:

f5d37bdcfd
21ba9e3692
576eeaee27
7ba1f57d53
e520f8f5b3
54f9092b0c
88bb770ce1
d91888de6c
ff06eb0881
fdaeb6ea30
1fd432f00f
60b7cb3408

Test Plan: n/a

Reviewed By: 2d2d2d2d2d

fbshipit-source-id: f63bd0a879f4d08e159f530f595067f5a09ffe70
2019-12-18 18:41:23 -08:00
fe707c7849 Use default_observer and default_weight_observer in tests (#31424)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31424

att

Test Plan:
test_jit.py

Imported from OSS

Differential Revision: D19162368

fbshipit-source-id: 33b95ba643eeeae942283bbc33f7ceda8d14c431
2019-12-18 18:35:07 -08:00
e1509cb468 Add support for del (#31273)
Summary:
Adds the `del` keyword to the parser and corresponding `aten::Delete` op for lists and dicts

Fixes #20615
](https://our.intern.facebook.com/intern/diff/19054937/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31273

Pulled By: driazati

Differential Revision: D19054937

fbshipit-source-id: c535ea16a9e62d176f8ad45947670fc3535af77c
2019-12-18 18:19:22 -08:00
e7d25a3e4d add a suggested alternative to _get_trace_graph
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/31441

Test Plan: Imported from OSS

Differential Revision: D19165646

Pulled By: suo

fbshipit-source-id: 96a264bc55ceafd798d92b986d319cddbb0d9c69
2019-12-18 17:34:25 -08:00
d2e66b44cc Temporary fix to support building pytorch from fbsource (for xplat dependencies) (#31393)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31393

pytorch build was set up with the include paths (-I) relative to fbcode/. This works well for fbcode builds, but doesn't work for the new fbcode_deps args for xplat build targets that work across xplat and fbcode. When these targets are built, the include paths need to be relative to fbsource, so fbcode/ suffix needs to be added to those paths.

Longer term, to properly fix this, we need to use raw_headers with public_include_directories specified for all of these targets.

Test Plan: buck test mode/dev //papaya/integration/service/local/test:mnist_federated_system_test -- 'MnistFederatedSystemTest\.test' --run-disabled

Reviewed By: mzlee

Differential Revision: D19148465

fbshipit-source-id: a610e84bf4cad5838e54e94bae71b957c4b6d4b5
2019-12-18 17:30:57 -08:00
a3cdb7eca3 Fix default instantation of dynamic quantized LSTM
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/31433

Test Plan: Imported from OSS

Differential Revision: D19164539

Pulled By: jamesr66a

fbshipit-source-id: 7045817ab3dfb530c4480a10523c4c6bcdbfc7eb
2019-12-18 16:59:00 -08:00
1e80ff7a67 autograd/profiler: make record_function more threadsafe (#31346)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31346

This makes it so that if profiling is enabled/disabled from a different thread while a RecordFunction span is active via an op it doesn't crash the process.

We currently see when using torch.distributed.rpc to enable/disable profiling on other nodes while other things are running.

Test Plan: buck test //caffe2/test:autograd -- test_record_function

Reviewed By: albanD

Differential Revision: D19133258

fbshipit-source-id: 30712b06c6aa051789948de2918dcfb9b78967ba
2019-12-18 16:27:42 -08:00
148bcd3ee5 Add support for builtins as attributes (#31269)
Summary:
Fixes #27495

This adds builtins as another piece of a concrete type. They're separate from normal functions since they represent the `BuiltinFunction` sugared value (which is a direct call to a builtin op). It also moves the builtins related logic from `jit/__init__.py` to `jit/_builtins.py` so it can be used from `jit/_recursive.py` to look up functions in the builtins table.
](https://our.intern.facebook.com/intern/diff/19149779/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31269

Pulled By: driazati

Differential Revision: D19149779

fbshipit-source-id: d4e5e5d7d7d528b75a2f503e6004394251a4e82d
2019-12-18 15:24:45 -08:00
503a4e9019 Cleanup after moving language reference (#31146)
Summary:
Stacked PRs
 * **#31146 - [jit] Cleanup after moving language reference**
 * #31138 - [jit] Move TorchScript language reference to its own page

Preview: https://driazati.github.io/pytorch_doc_previews/jit.html#torchscript-language

Pull Request resolved: https://github.com/pytorch/pytorch/pull/31146

Pulled By: driazati

Differential Revision: D19167390

fbshipit-source-id: f28daed36754a553264fc8ac142ed22c3e26d63e
2019-12-18 15:09:35 -08:00
ae2487bf4d Move TorchScript language reference to its own page (#31138)
Summary:
Stacked PRs
 * #31146 - [jit] Cleanup after moving language reference
 * **#31138 - [jit] Move TorchScript language reference to its own page**

Preview: https://driazati.github.io/pytorch_doc_previews/jit.html#torchscript-language

Pull Request resolved: https://github.com/pytorch/pytorch/pull/31138

Pulled By: driazati

Differential Revision: D19167375

fbshipit-source-id: d37110d85fc8b8d2c741be49846e873de1357c2a
2019-12-18 15:09:31 -08:00
d08250c223 fix zero-batch handling in convtranspose (#24341)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/24341

ConvTransposeOp doesn't crash for zero-batch, but it doesn't modify the output blob. This leads to buggy behaviour especially when running the same network twice using different input, or backprop during training.

Seems `ConvTransposeUnpoolBase<Context>::GetOutputSize` works for zero-batch, so I remove the check for `input.numel() > 0`, and reshape the output blob before returning.

For CudnnConvTransposeGradientOp, it's a bit verbose to set `dfilter` and `dbias`, it's a  seems the Cudnn can handle it, so simply remove the `X.numel() == 0` branch.

Test Plan: buck test mode/dev-nosan caffe2/caffe2/python/operator_test:conv_transpose_test -- --run-disabled

Reviewed By: BIT-silence

Differential Revision: D16807606

fbshipit-source-id: 0d72c5bd8f2e03c34465e7b530cca548d9bdd5e1
2019-12-18 15:06:36 -08:00
7692494c67 Fix hex literal parsing (#29935)
Summary:
Stacked PRs
 * #29940 - [jit] Fix parsing of big float literals
 * **#29935 - [jit] Fix hex literal parsing**
 * #29931 - [jit] Throw a better error for int too big for int64_t

Previously these were all parsed as `0`
](https://our.intern.facebook.com/intern/diff/19124944/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/29935

Pulled By: driazati

Differential Revision: D19124944

fbshipit-source-id: 1ee0c1dee589933363a5efba069a2cfaf94373c5
2019-12-18 14:00:22 -08:00
1f50cfc24d Throw a better error for int too big for int64_t
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/29931

Pulled By: driazati

Differential Revision: D19124934

fbshipit-source-id: 91841d7ba4f2f6142c51fba07b7faa14bb817e3a
2019-12-18 14:00:16 -08:00
fb30a48b4e add unsupported section (#31329)
Summary:
Add a section for unsupported ops, and modules. Automatically generate the properties and attributes that aren't bound, and for ops that have semantic mismatches set up tests so the docs stay up to date.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31329

Differential Revision: D19164472

Pulled By: eellison

fbshipit-source-id: 46290bb8a64d9de928cfb1eda5ff4558c3799c88
2019-12-18 13:56:02 -08:00
5e8bac24b4 Migrate soft_margin_loss from the TH to Aten (CUDA+CPU) (#28135)
Summary:
Fix: https://github.com/pytorch/pytorch/issues/24631, https://github.com/pytorch/pytorch/issues/24632, https://github.com/pytorch/pytorch/issues/24764, https://github.com/pytorch/pytorch/issues/24765

Port of TH SoftMarginCriterion to ATen using un-fused tensor operators but with custom backward code. This is a follow-up/fixc of reverted PR https://github.com/pytorch/pytorch/issues/27673.

Benchmark results:

CPU became faster, GPU slower. To reach previous TH perf probably manual fusion is necessary.

### WITH patch
```
CPU warmup 1000 took 7.997200009413064e-05
CPU warmup 10000 took 0.0008116499957395718
CPU warmup 100000 took 0.0012691459996858612
CPU warmup TOTAL time 0.0021982479956932366
CPU forward 1000 took 7.320100849028677e-05
CPU forward 10000 took 0.00015837099635973573
CPU forward 100000 took 0.0010471990099176764
CPU forward 1000000 took 0.01238470000680536
CPU forward 10000000 took 0.12747182900784537
CPU forward 100000000 took 1.2076255190040683
CPU forward TOTAL time 1.3488940890092636
CPU for- & backward 1000 took 0.00032587299938313663
CPU for- & backward 10000 took 0.0006926299975020811
CPU for- & backward 100000 took 0.002146183993318118
CPU for- & backward 1000000 took 0.019158899012836628
CPU for- & backward 10000000 took 0.2957490350090666
CPU for- & backward 100000000 took 1.7630806300003314
CPU for- & backward TOTAL time 2.081367089995183

GPU warmup 1000 took 0.0004558280052151531
GPU warmup 10000 took 0.0002567449992056936
GPU warmup 100000 took 0.0001593509950907901
GPU warmup TOTAL time 0.0009442300070077181
GPU forward 1000 took 0.00015061900194268674
GPU forward 10000 took 0.00015258099301718175
GPU forward 100000 took 0.00015409699699375778
GPU forward 1000000 took 0.0008183339959941804
GPU forward 10000000 took 0.004424853003001772
GPU forward 100000000 took 0.04356115800328553
GPU forward TOTAL time 0.04938192600093316
GPU for- & backward 1000 took 0.0008062430133577436
GPU for- & backward 10000 took 0.0006074949924368411
GPU for- & backward 100000 took 0.0007091690058587119
GPU for- & backward 1000000 took 0.001022183001623489
GPU for- & backward 10000000 took 0.009945805999450386
GPU for- & backward 100000000 took 0.0944173600000795
GPU for- & backward TOTAL time 0.28060428200114984
```

### WITHOUT patch
```
CPU warmup 1000 took 6.394000956788659e-05
CPU warmup 10000 took 0.00038220599526539445
CPU warmup 100000 took 0.0034939230099553242
CPU warmup TOTAL time 0.003981974994530901
CPU forward 1000 took 4.7855006414465606e-05
CPU forward 10000 took 0.000347569992300123
CPU forward 100000 took 0.003367935001733713
CPU forward 1000000 took 0.03605044000141788
CPU forward 10000000 took 0.35935167300340254
CPU forward 100000000 took 3.630371332008508
CPU forward TOTAL time 4.029640004009707
CPU for- & backward 1000 took 0.00028494100843090564
CPU for- & backward 10000 took 0.0006738200027029961
CPU for- & backward 100000 took 0.0051178760040784255
CPU for- & backward 1000000 took 0.04925115800870117
CPU for- & backward 10000000 took 0.7172313440096332
CPU for- & backward 100000000 took 5.441953932997421
CPU for- & backward TOTAL time 6.21466830400459

GPU warmup 1000 took 0.001803738996386528
GPU warmup 10000 took 0.00041877900366671383
GPU warmup 100000 took 0.0003870719956466928
GPU warmup TOTAL time 0.0026561370032140985
GPU forward 1000 took 0.00037833399255760014
GPU forward 10000 took 0.00038825398951303214
GPU forward 100000 took 0.0003841099969577044
GPU forward 1000000 took 0.0007090550061548129
GPU forward 10000000 took 0.0016171559982467443
GPU forward 100000000 took 0.013463679002597928
GPU forward TOTAL time 0.017010531009873375
GPU for- & backward 1000 took 0.0007374050037469715
GPU for- & backward 10000 took 0.0006343529967125505
GPU for- & backward 100000 took 0.0006375070079229772
GPU for- & backward 1000000 took 0.0007550300069851801
GPU for- & backward 10000000 took 0.002672752001672052
GPU for- & backward 100000000 took 0.023170708998804912
GPU for- & backward TOTAL time 0.20251446698966902
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/28135

Differential Revision: D18001447

Pulled By: VitalyFedyunin

fbshipit-source-id: ad90dc1cca42dcaf3ea9e17e4f8fd79cee0a293e
2019-12-18 13:33:59 -08:00
7cf8b9bada Move leaky_relu to Aten(CPU, CUDA) (#29899)
Summary:
VitalyFedyunin, This PR is about port LeakyReLU activation to Aten:
**Test script:**
```
import torch
import torch.nn as nn
import time

torch.manual_seed(0)
def _time():
    if torch.cuda.is_available():
        torch.cuda.synchronize()
    return time.time()

device = "cpu"
m = nn.LeakyReLU()
if torch.cuda.is_available():
    device = "cuda"
    m = m.cuda()

#warm up
for n in [100, 10000]:
    input = torch.randn(128, n, requires_grad=True, device=device)
    grad_output = torch.ones(128, n, device=device)
    for i in range(1000):
        output = m(input)
        output.backward(grad_output)

for n in [100, 10000]:
    fwd_t = 0
    bwd_t = 0
    input = torch.randn(128, n, requires_grad=True, device=device)
    grad_output = torch.ones(128, n, device=device)
    for i in range(10000):
        t1 = _time()
        output = m(input)
        t2 = _time()
        output.backward(grad_output)
        t3 = _time()
        fwd_t = fwd_t + (t2 -t1)
        bwd_t = bwd_t + (t3 - t2)
    fwd_avg = fwd_t / 10000 * 1000
    bwd_avg = bwd_t / 10000 * 1000
    print("input size(128, %d) forward time is %.2f (ms); backwad avg time is %.2f (ms)."
          % (n, fwd_avg, bwd_avg))
```
Test Device: CPU: skx-8180, GPU: Tesla P40.
Perfromance:
Before:
```
GPU:
input size(128, 100) forward time is 0.05 (ms); backwad avg time is 0.11 (ms).
input size(128, 10000) forward time is 0.06 (ms); backwad avg time is 0.17 (ms).

CPU:
OMP_NUM_THREADS=56
input size(128, 100) forward time is 0.05 (ms); backwad avg time is 0.14 (ms).
input size(128, 10000) forward time is 4.21 (ms); backwad avg time is 8.02 (ms).
OMP_NUM_THREADS=1
input size(128, 100) forward time is 0.02 (ms); backwad avg time is 0.07 (ms).
input size(128, 10000) forward time is 1.98 (ms); backwad avg time is 6.21 (ms)
```
After:
```
GPU:
input size(128, 100) forward time is 0.05 (ms); backwad avg time is 0.11 (ms).
input size(128, 10000) forward time is 0.06 (ms); backwad avg time is 0.17 (ms).

CPU:
OMP_NUM_THREADS=56
input size(128, 100) forward time is 0.02 (ms); backwad avg time is 0.04 (ms).
input size(128, 10000) forward time is 0.03 (ms); backwad avg time is 0.09 (ms).
OMP_NUM_THREADS=1
input size(128, 100) forward time is 0.01 (ms); backwad avg time is 0.02 (ms).
input size(128, 10000) forward time is 0.47 (ms); backwad avg time is 1.02 (ms).
```
How to set the numbers of thread? using following script:
```
num_threads=$1
script=$2
last_core=`expr $num_threads - 1`
echo "using $num_threads OMP threads"
echo "bind cores to 0~$last_core"
export OMP_NUM_THREADS=$num_threads
export KMP_AFFINITY=granularity=fine,compact,1,0
numactl --physcpubind=0-$last_core --membind=0 python $script
```
and run .**/run.sh num_threads test.py**.

Fixes https://github.com/pytorch/pytorch/issues/24583 #24584 https://github.com/pytorch/pytorch/issues/24720 #24721
Pull Request resolved: https://github.com/pytorch/pytorch/pull/29899

Differential Revision: D18816231

Pulled By: VitalyFedyunin

fbshipit-source-id: afb1e43a99317d17f50cff1b593cd8f7a0a83da2
2019-12-18 13:14:11 -08:00
b0bd35ff13 caffe2/event: allow multiple errors such as when cancelled (#31335)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31335

When an error occurs in a net we end up cancelling all the async ops. If one error occurs it's highly likely other errors will occur as well.

Typically we see:
1. SendOp failed due to a network error
2. async scheduling cancels all other ops via `SetFinished("Cancelled");`
3. Another SendOp fails due to a network error and crashes the process when the exception is thrown.

This changes caffe2 ops to allow failing twice.

Test Plan: buck test //caffe2/caffe2:caffe2_test_cpu

Reviewed By: andrewwdye

Differential Revision: D19106548

fbshipit-source-id: 4b7882258a240894cc16d061a563c83a3214d3d9
2019-12-18 13:10:57 -08:00
4d22c3ba01 fix docker login, add docker image tag list after purge as html (#31328)
Summary:
example of the generated html: http://ossci-docker.s3-website.us-east-1.amazonaws.com/pytorch.html
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31328

Differential Revision: D19147113

Pulled By: mingbowan

fbshipit-source-id: 5104e92d4490f047a6474e2b12aed3293b52a9df
2019-12-18 12:08:51 -08:00
47766e648f C++ API parity: MultiheadAttention
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/27309

Test Plan: Imported from OSS

Differential Revision: D17766736

Pulled By: pbelevich

fbshipit-source-id: 7a5f2399f081945d31d4c13d7a8d248c387fc1a6
2019-12-18 10:13:29 -08:00
c63f8e5ebe Fix typo in data.rst docs
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/31395

Differential Revision: D19160010

Pulled By: zou3519

fbshipit-source-id: cbc4e719e69117e8747617729d240c72e7a4e3dd
2019-12-18 09:52:10 -08:00
285cc13435 check devices for all input tensors in index_put (#31280)
Summary:
Fix for https://github.com/pytorch/pytorch/issues/30960
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31280

Differential Revision: D19149114

Pulled By: ngimel

fbshipit-source-id: af185a98ac6ea614f43bbf865de02ea113d4ed56
2019-12-18 09:25:40 -08:00
913323750d CODEOWNERS for distributed optimizer. (#31403)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31403

ghstack-source-id: 95874532

Test Plan: waitforbuildbot

Differential Revision: D19154217

fbshipit-source-id: a18ebe646b97c83cc0eb0821b10b4c76d5ce2878
2019-12-18 09:25:35 -08:00
359c39b3c2 Use global lock instead of per instance lock. (#31404)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31404

Multiple "trainers" could each create different instances of DistributedOptimizer, which means we can still have a race condition unless we do a trully global per worker lock.
ghstack-source-id: 95874624

Test Plan: run unit tests -- unfortunatelly due to the non-deterministic behavior it's not clear how to unit test this properly.

Differential Revision: D19154248

fbshipit-source-id: fab6286c17212f534f1bd1cbdf9f0de002d48c74
2019-12-18 09:22:54 -08:00
386cd59d44 Remove redundant queries of qconfig in insertObservers (#31292)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31292

att
Also we need to do this check after we call `insertObservers` on invoked modules
as well since qconfig can be None for parent module while being valid for invoked modules

Test Plan:
.

Imported from OSS

Differential Revision: D19146668

fbshipit-source-id: be6811353d359ed3edd5415ced29a4999d86650b
2019-12-18 09:15:52 -08:00
58d2dd5b73 Enabled flip for bool tensors (#31267)
Summary:
Fix this [issue](https://github.com/pytorch/pytorch/issues/31213)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31267

Differential Revision: D19047249

Pulled By: izdeby

fbshipit-source-id: f58ca3ac88aab28742b8d345400270f7d31c3856
2019-12-18 09:01:32 -08:00
3e59e80429 Revert D18941024: Move TorchScript language reference to its own page
Test Plan: revert-hammer

Differential Revision:
D18941024

Original commit changeset: d0ff600870a1

fbshipit-source-id: 01c0eac4c9741f27b91d710616e71a0d769f6f6a
2019-12-18 08:55:50 -08:00
3694749cd1 Detect dill version in torch.save/load (#30985)
Summary:
Fix for issue https://github.com/pytorch/pytorch/issues/28313
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30985

Differential Revision: D19142947

Pulled By: zou3519

fbshipit-source-id: 10e3a182a99e80ca8c9c8328b6f8764b27d78eb3
2019-12-18 08:05:08 -08:00
74e59c6fed caffe2::TypeInfo fix when using clang-cl on Windows (#31364)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31364

clang-cl defines both `_MSC_VER` and `__clang__`. Names are mangled clang style though. calling `extract` with the wrong name mangling pattern will throw `std::logic_error`. This crashes on Windows when `get_fully_qualified_type_name` is called because it is marked with `noexcept`.

Test Plan: Windows builds no longer crash on startup.

Reviewed By: mattjgalloway

Differential Revision: D19142064

fbshipit-source-id: 516b9b63daeff30f5c097d192b0971c7a42db57e
2019-12-18 07:51:07 -08:00
c05538b831 Move TorchScript language reference to its own page (#31138)
Summary:
Preview: https://driazati.github.io/pytorch_doc_previews/jit.html#torchscript-language
](https://our.intern.facebook.com/intern/diff/18941024/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31138

Pulled By: driazati

Differential Revision: D18941024

fbshipit-source-id: d0ff600870a14c4a7c6ce54867d152072a12c48c
2019-12-18 00:46:19 -08:00
3c8892aa0c avoid doing quadratic work in concrete type inference (#31020)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31020

Before, the recursive scripting process re-did the concrete type
inference process for every submodule call. This changes things so that
the concrete type inference process only occurs once (at the top level),
and we re-use all the inferred concrete types while recursively
compiling submodules.

This is both more efficient (we don't do n^2 work inferring concrete
types) and less bug-prone (since we infer the concrete type only once,
there is no possibility of a mismatch).

Test Plan: Imported from OSS

Differential Revision: D18904110

Pulled By: suo

fbshipit-source-id: 6560b85ae29fe5e9db1ee982dbf8bc222614b8d8
2019-12-17 21:55:55 -08:00
878b0e35f7 Simplify recursive script compilation flow. (#31019)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31019

No more `recurisve_script`, just direct calls to `create_script_module`.
This reduces the number of pathways through the frontend, and the
uniformity is useful for a future PR.

Test Plan: Imported from OSS

Differential Revision: D18904113

Pulled By: suo

fbshipit-source-id: 7de061dfef0cbdfc9376408fc6c1167b81803f01
2019-12-17 21:55:50 -08:00
82d52bc718 remove remnants of properties hack (#31018)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31018

Properties are now disallowed so this hack is no longer necessary

Test Plan: Imported from OSS

Differential Revision: D18904112

Pulled By: suo

fbshipit-source-id: 83448da677082d59355729bb72d9f9f4c31ea756
2019-12-17 21:55:45 -08:00
7e81d72d12 remove unnecessary arg from create_script_module (#31017)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31017

This arg is now derivable from another one. So we don't need to pass
both

Test Plan: Imported from OSS

Differential Revision: D18904111

Pulled By: suo

fbshipit-source-id: ea74ea9c2ae83d9e0e6977b0eb6629f53545e2e4
2019-12-17 21:55:41 -08:00
e5631119f6 use expect instead of casting in register_c10_ops (#31401)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31401

As title, just a mechanical change

Test Plan: Imported from OSS

Differential Revision: D19152965

Pulled By: suo

fbshipit-source-id: 6bb27df7c8f542c55110286c156358ba0936269f
2019-12-17 21:37:59 -08:00
4ec2448580 Update OVERVIEW.md (#31373)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31373

Just some housekeeping

Test Plan: Imported from OSS

Differential Revision: D19145987

Pulled By: suo

fbshipit-source-id: ae8142dab2bddcf0b628c27c426ca26334c48238
2019-12-17 21:29:16 -08:00
e0ab255a51 Updates to serialization.md (#31372)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31372

Keeping it current with the latest changes.

Test Plan: Imported from OSS

Differential Revision: D19145986

Pulled By: suo

fbshipit-source-id: 88122e66fa87a354ef8e87faffe58551074e3f03
2019-12-17 21:29:12 -08:00
e169e02836 Refactor custom op tests (#31282)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31282

Introduce a helper to easily call stack ops
ghstack-source-id: 95855728

Test Plan: unit tests

Differential Revision: D19061515

fbshipit-source-id: a7d6329e26cd3d94730d88c8a6393e10bfbd8e9b
2019-12-17 20:48:01 -08:00
c5d2758c35 Disable flaky TestMomentumSGD.test_fp16momentum_sgd (#31369)
Summary:
Related to https://github.com/pytorch/pytorch/issues/31368
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31369

Differential Revision: D19147072

Pulled By: VitalyFedyunin

fbshipit-source-id: 6fad13be7b35f992d84a20f23877cad05ff18616
2019-12-17 19:16:54 -08:00
e3fecabdcb Setup operator registration for distributed package (#31214)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31214

This set up the basic infrastructure for distributed autograd and rpc to
bind their operators to TorchScript, since the whole distributed package
is builtin behind the `USE_DISTRIBUTED` flag, we separate the
registration and build it only when the flag is on.

Test Plan: Imported from OSS

Differential Revision: D19137160

fbshipit-source-id: ff47dc4c380ebe273fe0eea9e5e3fccfbd6466d7
2019-12-17 17:26:43 -08:00
e33dea6e4e dynamicly quantized lstm benchmarking
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/30149

Test Plan: Imported from OSS

Differential Revision: D18613005

Pulled By: z-a-f

fbshipit-source-id: 966bfe2c862b1b4006b228bd9115c5c1cd3ad8cf
2019-12-17 16:52:04 -08:00
f0243ea712 Use [[deprecated]] instead of C10_DEPRECATED (#30918)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30918

This is a C++14 feature we can use now
ghstack-source-id: 95811482

Test Plan: waitforsandcastle

Differential Revision: D18869636

fbshipit-source-id: b5b3d78b61b6ceb2deda509131f8502e95b1d057
2019-12-17 15:21:34 -08:00
d9c3913dfc move BatchPermutationOp to caffe2/operators
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/31350

Reviewed By: houseroad

Differential Revision: D19053527

fbshipit-source-id: 50d11f137d0f5c07e8ad899a3a84d56a042bbc32
2019-12-17 14:58:27 -08:00
0b8332efb4 Remove c++11 examples from doc comments (#30925)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30925

-
ghstack-source-id: 95810835

Test Plan: it's just comments

Differential Revision: D18869634

fbshipit-source-id: 346498ae2472dbfe23ef40533bff891fde9922c4
2019-12-17 14:58:22 -08:00
5554e5b793 Docs: c++11 -> c++14 (#30530)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30530

Switch some mentions of "C++11" in the docs to "C++14"
ghstack-source-id: 95812049

Test Plan: testinprod

Differential Revision: D18733733

fbshipit-source-id: b9d0490eb3f72bad974d134bbe9eb563f6bc8775
2019-12-17 14:09:02 -08:00
cc8d6342fc make profiling take no_grad flags into account (#31071)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31071

Previously the profiler would think Tensors would require grad, even
when the no_grad flag is enabled during execution. This makes the profiling
and guards respect the no_grad flag, which eliminates extra differentiable
graphs that appear in the backward graph (where no_grad is typically enabled).

Test Plan: Imported from OSS

Differential Revision: D18915468

Pulled By: zdevito

fbshipit-source-id: 1ae816a16ab78ae5352825cc6b4a68ed7681a089
2019-12-17 13:22:16 -08:00
dab5f72543 we should have a config-based way to skip flaky tests (#30978)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30978

This particular approach queries our issue tracker for test titles that
match the following format:

```
DISABLED test_async_grad_guard_with_grad (jit.test_async.TestAsync)
```

And then skips the python test for them. There is 1 second timeout so
if the internet flakes we still run the test suite, without disabling any
tests.

This is intended as a quick fix, similar to ninja unland, to get to a green
master. Long term test disables should go into the code.

Test Plan: Imported from OSS

Pulled By: zdevito

Differential Revision: D18890532

fbshipit-source-id: fe9447e59a6d5c9ad345f7c3ff15d63b6d2a09e2
2019-12-17 11:58:43 -08:00
d2067569e7 Kill THTensor_(bhistc). (#31254)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31254

It's not used.

Test Plan: Imported from OSS

Differential Revision: D19022923

Pulled By: gchanan

fbshipit-source-id: caa5e6b7a133f24f8f3349fd1e53147f8dd3fd97
2019-12-17 08:54:17 -08:00
49eff2f43c Kill THSize. (#31218)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31218

It isn't used.

Test Plan: Imported from OSS

Differential Revision: D18986641

Pulled By: gchanan

fbshipit-source-id: 0a434941d12193941f097232c18ffe4268bf5f82
2019-12-17 08:54:13 -08:00
52b8a52e4d move AliasWithNameOp to caffe2/operators
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/31281

Reviewed By: houseroad

Differential Revision: D19053453

fbshipit-source-id: 350bfd5c001db9c17916dcae7ade8f56db1e9841
2019-12-17 02:39:40 -08:00
0e548a76eb Upgrade exported ONNX IR version to 6 (#31025)
Summary:
Upgrade IR version from 4 to 6, below is change doc from ONNX. The upgrade should be backward compatible.

```
  // IR VERSION 5 published on March 18, 2019
  // - Add message TensorAnnotation.
  // - Add quantization annotation in GraphProto to map tensor with its scale and zero point quantization parameters.
  IR_VERSION_2019_3_18 = 0x0000000000000005;

  // IR VERSION 6 published on Sep 19, 2019
  // - Add support for sparse tensor constants stored in model.
  //   - Add message SparseTensorProto
  //   - Add sparse initializers
  IR_VERSION = 0x0000000000000006;
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31025

Reviewed By: hl475

Differential Revision: D18935444

Pulled By: houseroad

fbshipit-source-id: 9ba47f9657fa1a668db291cf04af07d5e8d73c21
2019-12-16 23:18:22 -08:00
10ce1765be Introducing ScalarTypeType and LayoutType (#31074)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31074

As the title,
It's step 1 in https://github.com/pytorch/pytorch/pull/30694#issuecomment-564205276.

Not using those types in any other place.

Test Plan: Making sure all unit tests and build pass successfully.

Differential Revision: D18916246

fbshipit-source-id: c8213307ed196e1b51ce1a2a7c10869dcd45b79e
2019-12-16 21:46:47 -08:00
f9010d7648 remove wipe cache from op bench (#31334)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31334

The wipe cache logic was introduced hoping to reduce the variations in the benchmark results. Based on our experiments result, it didn't actually help with that. In addition, several engineers had encountered the issue of missing cpuinfo.h which was used in the wipe cache logic. So this diff removes that feature to ensure smooth installation and running of the op bench.

Test Plan:
```
buck run caffe2/benchmarks/operator_benchmark/pt:add_test -- --iterations 1
# ----------------------------------------
# PyTorch/Caffe2 Operator Micro-benchmarks
# ----------------------------------------
# Tag : short

# Benchmarking PyTorch: add
# Mode: Eager
# Name: add_M1_N1_K1_cpu
# Input: M: 1, N: 1, K: 1, device: cpu
Forward Execution Time (us) : 111.192

A/B test also pass Benchmark Run #2476535015

Reviewed By: hl475

Differential Revision: D19126970

fbshipit-source-id: 9b1ab48c121838836ba6e0ae664a48fe2d18efdd
2019-12-16 16:34:14 -08:00
229ce89b92 Fix coverage and hypothesis conflict (#31320)
Summary:
Temporarily enforcing versions for all envs.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31320

Differential Revision: D19122781

Pulled By: VitalyFedyunin

fbshipit-source-id: fe6473b177367371387d4b3b873131e7ecfbc0f8
2019-12-16 15:52:42 -08:00
c5d3be1102 Remove the second copy on calling dist_autograd_context._known_worker_ids() (#31206)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31206

Improvement on #25525.

- DistAutogradContext::getKnownWorkerIds() returns a unordered_map as temp value. No need to copy this temp value A into another temp value B.
ghstack-source-id: 95736296

Test Plan:
# Unit tests

```
buck test mode/dev-nosan //caffe2/test:dist_autograd_fork --  test_worker_ids_recorded
```

```
buck test mode/dev-nosan //caffe2/test:dist_autograd_fork_thrift -- test_context_cleanup_tensor_with_grad
```

Differential Revision: D5707771

fbshipit-source-id: 9fea83dc69b02047aef8b02a73028a260ac0be40
2019-12-16 15:07:39 -08:00
643ca5def2 Replace c10::guts::stuff with std::stuff (#30915)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30915

Since we now have C++14, we don't need these c10::guts helpers anymore
ghstack-source-id: 95777609

Test Plan: waitforsandcastle

Differential Revision: D18869639

fbshipit-source-id: 97716f932297c64c6e814410ac47b444c33d4e2e
2019-12-16 13:57:19 -08:00
c6a8f884d8 add copy_ operator the op bench (#31327)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31327

Adds copy_ operator to the benchmark suite

Test Plan:
```
buck run caffe2/benchmarks/operator_benchmark/pt:binary_test -- --iterations 1 --operators copy_
# ----------------------------------------
# PyTorch/Caffe2 Operator Micro-benchmarks
# ----------------------------------------
# Tag : short

# Benchmarking PyTorch: copy_
# Mode: Eager
# Name: copy__M1_N1_K1_cpu_dtype_onetorch.int32_dtype_twotorch.int32
# Input: M: 1, N: 1, K: 1, device: cpu, dtype_one: torch.int32, dtype_two: torch.int32
Forward Execution Time (us) : 60.645

Reviewed By: hl475

Differential Revision: D19122910

fbshipit-source-id: e5f0b0e2612daae0201b1b4a87f52b971e0cc4a8
2019-12-16 13:45:12 -08:00
d401ba1417 benchmark binary ops in binary_test (#31326)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31326

as title

Test Plan:
```
buck run caffe2/benchmarks/operator_benchmark/pt:binary_test -- --iterations 1

# ----------------------------------------
# PyTorch/Caffe2 Operator Micro-benchmarks
# ----------------------------------------
# Tag : short

# Benchmarking PyTorch: add
# Mode: Eager
# Name: add_in_one[64,1,64]_in_two[1,64,1]_cpu_dtypetorch.float32
# Input: in_one: [64, 1, 64], in_two: [1, 64, 1], device: cpu, dtype: torch.float32
Forward Execution Time (us) : 28080.802

Reviewed By: hl475

Differential Revision: D19120113

fbshipit-source-id: 1105de208f7609cc6d74f0b5bc6fe75f19146b28
2019-12-16 13:45:08 -08:00
455e85a2f1 Fix unflatten when dim is a negative integer (#31208)
Summary:
Changelog:
- Wrap dim to be a positive integer when dim is negative
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31208

Test Plan:
- Updated tests in test_namedtensor.py

Fixes https://github.com/pytorch/pytorch/issues/31184

Differential Revision: D19036569

Pulled By: zou3519

fbshipit-source-id: 86e01e20988dee7c4b6c73232f66282d687f9a2c
2019-12-16 12:48:03 -08:00
9ca61aec0f Kill THLogAdd (#31217)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31217

It doesn't seem to be used.

Test Plan: Imported from OSS

Differential Revision: D18986642

Pulled By: gchanan

fbshipit-source-id: 96d615df82731d2224d403ab6e2cad6d4c6674fd
2019-12-16 12:30:16 -08:00
409151e1bb Use [[noreturn]] instead of C10_NORETURN or CAFFE_NORETURN (#30917)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30917

This is a C++14 feature, we can use this now.
ghstack-source-id: 95255753

Test Plan: waitforsandcastle

Differential Revision: D18869637

fbshipit-source-id: dd02036b9faeaffa64b2d2d305725443054da31b
2019-12-15 23:54:16 -08:00
c95d46abbd Remove C++11 compatibility from c10::util::crc64_t (#30920)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30920

deletecode
ghstack-source-id: 95255641

Test Plan: waitforsandcastle

Differential Revision: D18869640

fbshipit-source-id: c3d7f4e1a29caff9fd8a8141c258f6f1c3fd830c
2019-12-15 23:43:02 -08:00
0d7391f8b2 Test cases for custom ops with autograd (#31003)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31003

-
ghstack-source-id: 95663728

Test Plan: unit tests

Differential Revision: D18896189

fbshipit-source-id: d71f7678fff644536fe30452ee21a5a7df1f1f0b
2019-12-15 22:37:24 -08:00
930d0751e6 Java Tensor hybrid, owns at::Tensor, no memcopy for java outputs. (#30501)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30501

**Motivation**:
In current state output of libtorch Module forward,runMethod is mem copied to java ByteBuffer, which is allocated, at least in some versions of android, on java heap. That could lead to intensive garbage collection.

**Change**:
Output java tensor becomes owner of output at::Tensor and holds it (as `pytorch_jni::TensorHybrid::tensor_` field) alive until java part is not destroyed by GC. For that org.pytorch.Tensor becomes 'Hybrid' class in fbjni naming and starts holding member field `HybridData mHybridData;`

If construction of it starts from java side - java constructors of subclasses (we need all the fields initialized, due to this `mHybridData` is not declared final, but works as final) call `this.mHybridData = super.initHybrid();` to initialize cpp part (`at::Tensor tensor_`).

If construction starts from cpp side - cpp side is initialiaed using provided at::Tensor with `makeCxxInstance(std::move(tensor))` and is passed to java method `org.pytorch.Tensor#nativeNewTensor` as parameter `HybridData hybridData`, which holds native pointer to cpp side.

In that case `initHybrid()` method is not called, but parallel set of ctors of subclasses are used, which stores `hybridData` in `mHybridData`.

Renaming:
`JTensor` -> `TensorHybrid`

Removed method:
`JTensor::newAtTensorFromJTensor(JTensor)` becomes trivial `TensorHybrid->cthis()->tensor()`

Test Plan: Imported from OSS

Differential Revision: D18893320

Pulled By: IvanKobzarev

fbshipit-source-id: df94775d2a010a1ad945b339101c89e2b79e0f83
2019-12-15 21:36:20 -08:00
60ec53c7fd Fix copy kernel speed regression introduced in #29631 (#31279)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/31271

This fixes copy kernel speed regression introduced in https://github.com/pytorch/pytorch/issues/29631.

The previous implementation forces the compiler to instantiate `static_cast_with_inter_type` because it is passed as an argument of a function. This behavior makes it impossible for compilers to do optimizations like automatic vectorization, and, function call itself is expensive compared to a single casting instruction.

To check the change, run
```
readelf -Ws /home/xgao/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so | grep static_cast_with_inter_type
```

On nightly build, we have output
```
168217: 0000000001852bf0     5 FUNC    LOCAL  DEFAULT    9 _ZN3c1027static_cast_with_inter_typeIsdE5applyEd
168816: 0000000001852d30    33 FUNC    LOCAL  DEFAULT    9 _ZN3c1027static_cast_with_inter_typeISt7complexIfEaE5applyEa
168843: 00000000018531f0     7 FUNC    LOCAL  DEFAULT    9 _ZN3c1027static_cast_with_inter_typeIblE5applyEl
168930: 0000000001852c20     3 FUNC    LOCAL  DEFAULT    9 _ZN3c1027static_cast_with_inter_typeIslE5applyEl
168935: 00000000018528d0   124 FUNC    LOCAL  DEFAULT    9 _ZN3c1027static_cast_with_inter_typeIfNS_4HalfEE5applyES1_
169023: 0000000001852f30    17 FUNC    LOCAL  DEFAULT    9 _ZN3c1027static_cast_with_inter_typeISt7complexIdEhE5applyEh
169713: 00000000018525c0     3 FUNC    LOCAL  DEFAULT    9 _ZN3c1027static_cast_with_inter_typeIahE5applyEh
170033: 0000000001852c10     3 FUNC    LOCAL  DEFAULT    9 _ZN3c1027static_cast_with_inter_typeIsiE5applyEi
170105: 0000000001852bd0     5 FUNC    LOCAL  DEFAULT    9 _ZN3c1027static_cast_with_inter_typeIshE5applyEh
170980: 0000000001852fc0    27 FUNC    LOCAL  DEFAULT    9 _ZN3c1027static_cast_with_inter_typeISt7complexIdES1_IfEE5applyES3_
171398: 0000000001852810    13 FUNC    LOCAL  DEFAULT    9 _ZN3c1027static_cast_with_inter_typeIdbE5applyEb
171574: 00000000018532e0    35 FUNC    LOCAL  DEFAULT    9 _ZN3c1027static_cast_with_inter_typeIbNS_8BFloat16EE5applyES1_
171734: 0000000001852b20     6 FUNC    LOCAL  DEFAULT    9 _ZN3c1027static_cast_with_inter_typeIlSt7complexIdEE5applyES2_
172422: 0000000001853350    54 FUNC    LOCAL  DEFAULT    9 _ZN3c1027static_cast_with_inter_typeINS_8BFloat16EaE5applyEa
172704: 00000000018533c0    38 FUNC    LOCAL  DEFAULT    9 _ZN3c1027static_cast_with_inter_typeINS_8BFloat16EfE5applyEf
172976: 0000000001852890    10 FUNC    LOCAL  DEFAULT    9 _ZN3c1027static_cast_with_inter_typeIflE5applyEl
173038: 0000000001852f80     9 FUNC    LOCAL  DEFAULT    9 _ZN3c1027static_cast_with_inter_typeISt7complexIdEfE5applyEf
173329: 00000000018531c0    20 FUNC    LOCAL  DEFAULT    9 _ZN3c1027static_cast_with_inter_typeIbfE5applyEf
173779: 00000000018524d0     3 FUNC    LOCAL  DEFAULT    9 _ZN3c1027static_cast_with_inter_typeIhiE5applyEi
174032: 0000000001852960    14 FUNC    LOCAL  DEFAULT    9 _ZN3c1027static_cast_with_inter_typeIfNS_8BFloat16EE5applyES1_
174334: 0000000001852d60    29 FUNC    LOCAL  DEFAULT    9 _ZN3c1027static_cast_with_inter_typeISt7complexIfEdE5applyEd
174470: 0000000001852c60   124 FUNC    LOCAL  DEFAULT    9 _ZN3c1027static_cast_with_inter_typeIsNS_4HalfEE5applyES1_
174770: 0000000001852bc0    15 FUNC    LOCAL  DEFAULT    9 _ZN3c1027static_cast_with_inter_typeIlNS_8BFloat16EE5applyES1_
176408: 0000000001853980   144 FUNC    LOCAL  DEFAULT    9 _ZN3c1027static_cast_with_inter_typeINS_4HalfEbE5applyEb
176475: 0000000001852790   128 FUNC    LOCAL  DEFAULT    9 _ZN3c1027static_cast_with_inter_typeIdNS_4HalfEE5applyES1_
....
```

And after this PR, we get empty output
```
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31279

Differential Revision: D19075587

Pulled By: ngimel

fbshipit-source-id: c20088241f39fa40c1d055f0a46eb5b9ece52e71
2019-12-15 14:01:31 -08:00
9dc3d8738c fix view call on discontiguous tensor in to_sparse_backward (#31223)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/30820
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31223

Differential Revision: D19044172

Pulled By: ngimel

fbshipit-source-id: ac9fa71197d4f6c5b90a26e8d23360250745a2e2
2019-12-15 11:51:47 -08:00
0e50c1b0d9 Replace assert with cuda assert macro (#31297)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31297

Follow-up to https://github.com/pytorch/pytorch/pull/31276

This is final replacement needed for aten out of place hipification.

Test Plan: wait for CI to clear.

Reviewed By: bddppq

Differential Revision: D19070209

fbshipit-source-id: 1428cd0ddfb5a8f4e234fabce822285e898047ea
2019-12-15 05:43:00 -08:00
ec92711aac Fix error message in incorrect rref.localValue() call (#31199)
Summary:
Closes https://github.com/pytorch/pytorch/issues/31198, see the issue for more details. We throw an error when `local_value()` is called on a non-owned rref, but the incorrect node name is printed in the error message. This PR fixes that and adds a relevant unit test.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31199

Differential Revision: D19072014

Pulled By: rohan-varma

fbshipit-source-id: 760c20bfd2fbf286eaaca19500469509a575cfec
2019-12-14 22:51:00 -08:00
ffe0c1ae4d Make test_torch.py pass cuda-memcheck (#29243)
Summary:
Make the following changes:
- When there are more than 10k errors, cuda-memcheck only shows 10k errors, in this case we shouldn't raise an Exception
- Add UNDER_CUDA_MEMCHECK environment to allow disabling `pin_memory` tests when running cuda-memcheck.
- Add a `--ci` command option, when turned on, then this script would run output to stdout instead of writing a file, and exit with an error if cuda-memcheck fails
- Add a `--nohang` command option. When turned on, then hang would be treated as pass instead of error
- Do simple filtering on the test to run: if `'cpu'` in the test name but not `'cuda'` is not in the test name
- Add `--split` and `--rank` to allowing splitting the work (NVIDIA CI has a limitation of 3 hours, we have to split the work to satisfy this limitation)
- The error summary could be `ERROR SUMMARY: 1 error`, or `ERROR SUMMARY: 2 errors`, the tail could be `error` or `errors`, it is not of the same length. The script is fixed to handle this case.
- Ignore errors from `cufft`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/29243

Differential Revision: D18941701

Pulled By: mruberry

fbshipit-source-id: 2048428f32b66ef50c67444c03ce4dd9491179d2
2019-12-14 20:29:58 -08:00
701e05dcbb Buck test targets robolectric,instrumentattion
Summary:
Buck targets for robolectric and instrumentation tests for pytorch android:
```
buck test fbsource//fbandroid/mode/server //xplat/caffe2/android:test_host
```
```
buck test //xplat/caffe2/android:test_instrumentation
```
For both:
```
buck test fbsource//fbandroid/mode/server //xplat/caffe2/android:pytorch
```

Models in assets:
`pt_android_test_asset` - creates buck target that can be included in both robolectric and instrumentation tests that contains asset created from provided torchscript sources as separate file, using the latest binaries of libtorch.

`pt_gen_test_asset_bin`  does that tacing, usage format
```
generate_test_asset input_file.jit output_file.py
```

Example of test-host setup for users of pytorch android:
robolectric tests:

```
load("fbsource//xplat/caffe2:pt_defs.bzl", "pt_android_test_asset", "pt_predictor_binary", "PT_ANDRIOID_TEST_HOST_JNI_DEPS")

pt_android_test_asset(
    name = "test_asset",
    src = "test_asset.jit",
    asset_name = "test_asset.pt",
)

robolectric3_test(
    name = "example_test_host",
    srcs = [...],
    jni_deps = PT_ANDRIOID_TEST_HOST_JNI_DEPS,
    deps = [
        ":pytorch_common",
        ":test_asset",
        "//fbandroid/java/com/facebook/soloader/annotation:annotation",
        "//fbandroid/java/com/facebook/testing/robolectric/v3:v3",
        "//fbandroid/libraries/soloader/java/com/facebook/soloader:soloader",
        "//fbandroid/third-party/java/robolectric3/robolectric:robolectric",
    ],
)
```

COMMON_LINKER_FLAGS = ["-Wl,--no-as-needed"] can not be applied on MacOs

Test Plan:
```
[twsvcscm@od0187.atn1 /data/sandcastle/boxes/fbsource (b416b20a)]$ buck test fbsource//fbandroid/mode/server //xplat/caffe2/android:pytorch
Parsing buck files: finished in 7.2 sec
Creating action graph: finished in 0.7 sec
Building: finished in 11.9 sec (100%) 791/791 jobs, 0 updated
  Total time: 19.9 sec
Testing: finished in 11.0 sec (30 PASS/0 FAIL)
RESULTS FOR //xplat/caffe2/android:test_host //xplat/caffe2/android:test_instrumentation
PASS     159ms 15 Passed   0 Skipped   0 Failed   org.pytorch.PytorchHostTests
PASS     152ms 15 Passed   0 Skipped   0 Failed   org.pytorch.PytorchInstrumentedTests (localhost:31930)
TESTS PASSED
```

OSS changes test:
```
gradle -p android pytorch_android:cAT passes
```

Reviewed By: dreiss

Differential Revision: D18799005

fbshipit-source-id: 881609826a837efebc8526aee40355c5a62947d0
2019-12-14 20:29:52 -08:00
57ee7dab87 Wraps assert statements in cuda kernels (#31276)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31276

Change assert --> CUDA_ASSERT_KERNEL to avoid hip undefined __assert_fail()

This is similar to https://github.com/pytorch/pytorch/pull/13902 in caffe2 land.

Test Plan: wait for CI to clear

Reviewed By: bddppq

Differential Revision: D19047582

fbshipit-source-id: 34703b03786c8eee9c78d2459eb54bde8dc21a57
2019-12-14 20:29:47 -08:00
58eb15f41c JIT Type parser for mobile (#30391)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30391

A Type parser to parse the python string of a Type. For example,
"Tuple[str, Optional[float], Dict[str, List[Tensor]], int]".
Please refer to test_type_parser.cpp for the usage.

One of the use cases is in lite interpreter, types needs to be serialized (directly calling the python_str() of the Type) and deserialized (calling parseType(str)).

Test Plan: Imported from OSS

Differential Revision: D18924268

Pulled By: iseeyuan

fbshipit-source-id: 830d411563abfbeec023f01e7f8f4a1796f9a59a
2019-12-14 20:29:42 -08:00
065685180d Loading module from android asset (#30378)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30378

Loading module directly from android assets. Iteration on https://github.com/pytorch/pytorch/pull/30109
Loading Module:
```
mModule = AndroidUtils.loadModuleFromAsset(assetName, getAssets());
```

`org.pytorch.AndroidUtils` is excluded from pytorch_jni host build

Testing:
test_app module load switched to this approach and works fine
```
gradle test_app:installMobNet2QuantDebug -PABI_FILTERS=x86 && adb shell am start -n org.pytorch.testapp.mobNet2Quant/org.pytorch.testapp.MainActivity
```

Test Plan: Imported from OSS

Differential Revision: D18893269

Pulled By: IvanKobzarev

fbshipit-source-id: a7c73776f40e9c67bef233da05db56cc6efbe76a
2019-12-14 20:29:37 -08:00
70013415c7 DDP should not set grad for globally unused params (#28883)
Summary:
https://github.com/pytorch/pytorch/issues/28294 DDP should not set grad for globally unused parameters

DDP currently computes the param to bucket mapping upfront, and allreduce grads for all params in every iteration. Even if params are unused, it will just set grad to zero. With such behavior, optimizer cannot tell if a param indeed has a zero grad or it is not used in the current iteration. This could trigger convergence problems for optimizers with weight decay and momentum such as SGD. However, DDP cannot simply set grad to None for local unused parameters, as local unused parameters might be used in other processes, and hence we still need to allreduce its grad. Instead DDP should figure out the globally unused parameters and skip touching their grad in the end of backward.

Implementation summary:
* Add locally used parameter map for each model replica.
* Mark the locally unused parameters in the end of forward and then reduce to get the globally unused parameters.
* In the end of backward skip touching grad for those globally unused parameters.
* Add a unit test test_global_local_unused_params_grad
Pull Request resolved: https://github.com/pytorch/pytorch/pull/28883

Differential Revision: D18491530

Pulled By: mrshenli

fbshipit-source-id: 24e9b5f20df86c34ddbf9c7106250fd6ce186699
2019-12-14 20:29:32 -08:00
7cb83bea3b Fix static cuda builds on older cmake versions (#30935)
Summary:
Fixes https://github.com/pytorch/pytorch/pull/28378#issuecomment-562597033

To reproduce the failure I had to downgrade to `cmake 3.9` (Ubuntu 18 uses 3.10 apparently). These older `cmake` versions unfortunately don't seem to allow `target_link_libraries(INTERFACE)` to be used with imported libraries. Switching back to `set_property(TARGET)` fixes the issue.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30935

Differential Revision: D18956912

Pulled By: albanD

fbshipit-source-id: a2b728ee3268599a428b7878c988e1edef5d9dda
2019-12-14 20:29:27 -08:00
7c1b5084a7 Enable equality operator for bfloat16 CPU scalar types. (#30817)
Summary:
See https://github.com/pytorch/xla/issues/1330 for reference.

mruberry ailzhang FYI
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30817

Differential Revision: D18847375

Pulled By: mruberry

fbshipit-source-id: d1efedf8b975b8d9b55cf0ddf141818eaa7c91f0
2019-12-14 20:29:21 -08:00
2950530031 caffe2::TypeMeta uses compile time type names (#26619)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/26619

ghstack-source-id: 95348564

Test Plan: unit tests

Differential Revision: D17519252

fbshipit-source-id: 337ec76d17172dd1af60a1676d69964a41dcb7a1
2019-12-14 20:29:16 -08:00
6e1e09fd10 Compile time type names (#26618)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/26618

Implement a mechanism to get type names at compile time
In a future diff, I'm planning to introduce this to caffe2::TypeMeta and a few other places.
ghstack-source-id: 95337871

Test Plan: unit tests

Differential Revision: D17519253

fbshipit-source-id: e14017f962fd181d147accb3f53fa8d6ee42a3f8
2019-12-14 20:29:11 -08:00
c35cddb306 Switch default memory format of clone operator to Preserve
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/30089

Test Plan: Imported from OSS

Differential Revision: D18624985

Pulled By: VitalyFedyunin

fbshipit-source-id: 8d315b08b7b5858fd0a81d3375b44ccb94787ad4
2019-12-14 20:29:06 -08:00
fde3d707ad Switch default memory format of to (and similar) operators to Preserve
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/30088

Test Plan: Imported from OSS

Differential Revision: D18624984

Pulled By: VitalyFedyunin

fbshipit-source-id: 54901786d7496c7dce785140b0585ac9093b1d86
2019-12-14 20:29:01 -08:00
927588df8e Switch default memory format of _like operators to Preserve
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/30087

Test Plan: Imported from OSS

Differential Revision: D18624986

Pulled By: VitalyFedyunin

fbshipit-source-id: 8e434966f872ffaddf1249248ea445cbbab300ce
2019-12-14 20:28:57 -08:00
1ec989404c Kill some unnecessary function declarations.
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/31216

Test Plan: Imported from OSS

Differential Revision: D18986640

Pulled By: gchanan

fbshipit-source-id: 30630d9ea025bb510f85e9627cbb4ba46de5e93d
2019-12-14 20:28:52 -08:00
d7d07e7caf thrust is included in SortingKthValue.cu but never used
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/31263

Differential Revision: D19042793

Pulled By: ngimel

fbshipit-source-id: 28f06c46a53e15f106ebee6c36e2ad25a3676bd2
2019-12-14 20:28:47 -08:00
cd3f05b44d Small fixes for hipification (#31200)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31200

We do not hipify these files when doing out of place.

Test Plan: wait for CI to clear.

Differential Revision: D18963683

fbshipit-source-id: eeba8597143f26417d0a8181a4c746139afefa24
2019-12-14 20:28:43 -08:00
9954739956 Refactor test for unique and unique_consecutive and fix some bugs (#31211)
Summary:
Tests for unique_dim will be refactored in a separate PR.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31211

Differential Revision: D19034968

Pulled By: ngimel

fbshipit-source-id: 855d326b37638b5944f11fbbce03394cf000daf9
2019-12-14 20:28:38 -08:00
3587f769dc use propagate_names instead of propagate_names_for_reduction for cumsum and cumprod
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/31134

Differential Revision: D18964172

Pulled By: anjali411

fbshipit-source-id: 3050c6d283a469a858378c44ac2ab9102baefce5
2019-12-14 20:28:33 -08:00
a9ad98fb25 Remove unused argument "destId" in addSendRpcBackward (#31207)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31207

Cleanup after #30914.

In #30914, `autogradContext->addKnownWorkerId(dst);` was moved out of `addSendRpcBackward()`.

So `addSendRpcBackward()` does not need `dstId` as it's argument anymore.
ghstack-source-id: 95509218

Test Plan:
# Unit tests

```
buck test mode/dev-nosan //caffe2/test:dist_autograd_fork -- test_context_cleanup_tensor_no_grad
```

Differential Revision: D5742365

fbshipit-source-id: accd041a594ec18d369231f5590289828d87baa7
2019-12-14 20:28:29 -08:00
8fea7a49d6 pinning hypothesis for windows
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/31169

Differential Revision: D19036734

Pulled By: mingbowan

fbshipit-source-id: 2205a40720329cb53e741c9827c9049142759588
2019-12-14 20:28:24 -08:00
b64baa963f Robustify rpc_agent handlers with generic Future<T> (#31224)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31224

If a future coming back to a rpc_agent server is satisfied with an
exception, ensure this information is propagated back over the wire.
ghstack-source-id: 95522418

Test Plan: buck test mode/dev-nosan caffe2/torch/fb/distributed/thriftRpcBackend/...

Differential Revision: D18979185

fbshipit-source-id: 99848ae805cc2d48948809a238f61a2e0ef234c9
2019-12-14 20:28:20 -08:00
36d17f4105 abort nccl communicators before throwing operation timed out (#31128)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31128

When operation times out due to some errors that are not detected by nccl communicators, ncclCommWatchdog can not check this time out error and thus can not abort ncclComms accordingly. So explicitly abort ncclComms here before throwing this timed out exception to users, after this, ncclCommWatchdog can detect nccl communicators are aborted and clean up devNCCLCommMap_ accordingly. if throwing timed out excepiton without aborting nccl communicators here, it was observed that CUDA GPU will have 100% utilization and can not run new events successfully.
ghstack-source-id: 95528488

Test Plan: newly revised test _test_nccl_errors_blocking passed with the changes in this diff; the reviesed test failed withtout the changes in this diff

Reviewed By: isunjin

Differential Revision: D18928607

fbshipit-source-id: be65a05ce4ff005f0c7fed36ae8e28903e8ffe2b
2019-12-13 00:33:36 -08:00
1ef99cf0ab Intrusive_ptr implementation slower than shared_ptr (#30810)
Summary:
It was a random coding exercise so I wasn't putting much effort into it; but, I was like "hey is the current intrusive_ptr implementation optimized enough?" so I compared it with shared_ptr (using std::shared_from_this).

My benchmark result shows that intrusive_ptr is actually slower. On my macbook the speed is:

```
---------------------------------------------------------------
Benchmark                        Time           CPU Iterations
---------------------------------------------------------------
BM_IntrusivePtrCtorDtor         14 ns         14 ns   52541902
BM_SharedPtrCtorDtor            10 ns         10 ns   71898849
BM_IntrusivePtrArray         14285 ns      14112 ns      49775
BM_SharedPtrArray            13821 ns      13384 ns      51602
```

Wanted to share the results so someone could probably take a look if interested.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30810

Reviewed By: yinghai

Differential Revision: D18828785

Pulled By: bddppq

fbshipit-source-id: 202e9849c9d8a3da17edbe568572a74bb70cb6c5
2019-12-13 00:25:36 -08:00
f7c92f60ba Typo in filename align with classname
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/31235

Test Plan: Imported from OSS

Differential Revision: D19001793

Pulled By: IvanKobzarev

fbshipit-source-id: ae7f410be6b3c291f1feb3027b5b4a6b7ce15ab3
2019-12-12 23:16:29 -08:00
db90a5b992 Switch to open sourced fbjni (#30175)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30175

fbjni was opensourced and java part is published as 'com.facebook.fbjni:fbjni-java-only:0.0.3'
switching to it.
We still need submodule fbjni inside the repo (which is already pointing to  https://github.com/facebookincubator/fbjni) for so linking.

**Packaging changes**:
before that `libfbjni.so` came from pytorch_android_fbjni dependency, as we also linked fbjni in `pytorch_android/CMakeLists.txt` - it was built in pytorch_android, but excluded for publishing. As we had 2 libfbjni.so there was a hack to exclude it for publishing and resolve duplication locally.
```
        if (rootProject.isPublishing()) {
            exclude '**/libfbjni.so'
        } else {
            pickFirst '**/libfbjni.so'
        }
```

After this change fbjni.so will be packaged inside pytorch_android.aar artefact and we do not need this gradle logic.

I will update README in separate PR after landing previous PR to readme(https://github.com/pytorch/pytorch/pull/30128) to avoid conflicts

Test Plan: Imported from OSS

Differential Revision: D18982235

Pulled By: IvanKobzarev

fbshipit-source-id: 5097df2557858e623fa480625819a24a7e8ad840
2019-12-12 20:05:22 -08:00
199e1fb348 Use AVX2 to increase frequency for FP16<->FP32 Caffe2 ops (#31203)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31203

For multi-instance environment, AVX2 should help increase the clock frequency.
ghstack-source-id: 95502576

Test Plan: buck test //caffe2/caffe2:caffe2_test_cpu -- "Float16"

Reviewed By: jspark1105

Differential Revision: D18962649

fbshipit-source-id: 6532d929a99f41f2f6ad1a1a1962e38ae3ddaecb
2019-12-12 19:42:29 -08:00
ca8cb3241a Expose setNumThreads to android api (#31205)
Summary:
PR https://github.com/pytorch/pytorch/pull/31033 was unlanded due to macos build failure:
https://app.circleci.com/jobs/github/pytorch/pytorch/3916388

This PR has changes that `setNumThreads` is only for android and moved to separate class `org.pytorch.PytorchAndroid` as a static function which is better as it has global effect
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31205

Reviewed By: dreiss

Differential Revision: D18977250

Pulled By: IvanKobzarev

fbshipit-source-id: 4995859808af498c82933c4db52bd7c7dfae90e5
2019-12-12 18:57:27 -08:00
b7c148013f fix torch square_ benchmark runtime error (#31221)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31221

This is fixing the runtime error introduced in https://github.com/pytorch/pytorch/pull/30719 that added torch square_ operator to the benchmark suite.

Test Plan:
```
# ----------------------------------------
# PyTorch/Caffe2 Operator Micro-benchmarks
# ----------------------------------------
# Tag : short

# Benchmarking PyTorch: square_
# Mode: Eager
# Name: square__M512_N512_cpu
# Input: M: 512, N: 512, device: cpu
Forward Execution Time (us) : 66.291

Reviewed By: hl475

Differential Revision: D18987889

fbshipit-source-id: 09c56e3a73aab5ab661aac2b06429063b3a82fac
2019-12-12 18:48:02 -08:00
f30b14dead Fix handling of type comments in body (#30590)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/30477. Any type comment after `# type: (...) -> ` is ignored.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30590

Differential Revision: D18887351

Pulled By: driazati

fbshipit-source-id: 162c652f6d7610d14609bbcb25aaa27cdd947a76
2019-12-12 18:19:30 -08:00
20a2e526ef build a generic future<T> (#29579)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/29579

Per #28923, this diff is to move Future<Message> to torch::utils and extend it to be Future<T>, most of implementations are copied from FutureMessage and ivalue::Future. merge ivalue::Future with Future<T> will be done separately.

The main difference between Future<T>  and FutureMessage is the error handling, instead of checking message type inside Future to handle error, this future<T> owns has_error_ and error_ states.

also this future passes value_, has_error_ and error_ states to callbacks for easily read future states.

In next diff, a torch script rpc async API will be created, before the API returns, it will create an ivalue::Future and passes it to Future<T>'s call back where state of ivalue::Future will be set.  In this way, the torch script rpc async API  can still return a ivalue::Future and call wait() to get its state appropriately afterwards.
ghstack-source-id: 95479525

Test Plan: unit tests

Differential Revision: D18263023

fbshipit-source-id: 48a65712656a72c2feb0bb3ec8b308c0528986a6
2019-12-12 16:57:14 -08:00
c08f2ea254 Updating submodules
Summary:
GitHub commits:

367861fec0
22f5444c09
11c103407d
34507cb383
16d5e3e5ac
c4ce8e637f
0f7ef79620
330fa43933

Test Plan: n/a

Reviewed By: zpao

fbshipit-source-id: 2b6847af7ccba6b53a866e3fded2edf9995b0aaf
2019-12-12 16:53:44 -08:00
5ef0d6f854 Remove subgraphNode kind assert in unmergeSubgraph (#31212)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31212

To be able to use this function more broadly.

Test Plan: unit tests

Reviewed By: jackm321

Differential Revision: D18978913

fbshipit-source-id: d998dc7c7f9540f491a8a4bc5d6d25d9c3bf8764
2019-12-12 15:59:55 -08:00
a2463cbc38 Adding quantized clamp kernel (#30541)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30541

ghstack-source-id: 95450749

Adding quantized clamp kernel

Test Plan:
Added test.

buck test mode/dev //caffe2/test:quantized -- 'test_qclamp \(test_quantized\.TestQuantizedOps\)' --print-passing-details

Differential Revision: D18739628

fbshipit-source-id: 38a029ab96c5b0689bb15c67dc4f274883e74975
2019-12-12 15:54:40 -08:00
1d5af9599d Update ONNX Flatten to accept negative indices in opset 11 (#30751)
Summary:
Update ONNX Flatten to accept negative indices in opset 11.
With this change, some cases of flatten do not rely on the input rank being available.
Fixes : https://github.com/pytorch/pytorch/issues/30512 .
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30751

Reviewed By: hl475

Differential Revision: D18946904

Pulled By: houseroad

fbshipit-source-id: a6fa30a9182fff92211e505a19325525c6112f19
2019-12-12 15:27:54 -08:00
84d6796658 move AWS ECR gc jobs to circleci (#30996)
Summary:
all jobs are currently running with "--dry-run", so you can verify if the jobs are doing the right thing.  i'll remove the flag and make it runs every hour same as on Jenkins once this PR is approved.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30996

Differential Revision: D18971001

Pulled By: mingbowan

fbshipit-source-id: 2384bdb50ebdf47aad265395f26be3843f0ce05e
2019-12-12 14:28:20 -08:00
5c936845cf fix torch_train build (#30497)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30497

fix torch_train build

Test Plan: buck build //xplat/caffe2:torch_trainAndroid

Reviewed By: dreiss

Differential Revision: D18719662

fbshipit-source-id: a3d06b4068d502dbe29681d9f26906f2b8c7b622
2019-12-12 14:20:17 -08:00
a38184dbab Only create OwnerRRefs when processing remote calls (#31163)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31163

The purpose is to unblock integration with TorchScript. Currently,
an OwnerRRef will be created by either a remote call or a to_here
call, whichever arrives first. However, when making RRef an IValue,
we need to know the type of value held by the RRef, which is
retrived by checking the return type of the TorchScript function.
The TorchScript function is only avaible during the remote call
but not in the to_here() call. Hence, an OwnerRRef can only be
created when processing a remote call. This commit implements this
behavior by introducing a conditional variable for every OwnerRRef
in the RRefContext, and let the to_here() call and PyRRef::unpickle
block on the CV until the value is ready.

Test Plan: Imported from OSS

Differential Revision: D18949591

Pulled By: mrshenli

fbshipit-source-id: 17513c6f1fd766885ea8e1cd38f672a403fa4222
2019-12-12 14:02:04 -08:00
f6c31f61c5 Enabled roll for bool tensor (#31194)
Summary:
Fixed this [issue](https://github.com/pytorch/pytorch/issues/31079).
Tested via unit test
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31194

Differential Revision: D18958141

Pulled By: izdeby

fbshipit-source-id: 119bf4d31df10ee02c277f5a4663038470cf7780
2019-12-12 13:48:14 -08:00
bee6344d4e remove / rewrite weak module tests (#31193)
Summary:
Remove most of the testing for `weak_script`, since we removed it. Refactor a few of the existing tests to use recursive scripting api.

Fix for https://github.com/pytorch/pytorch/issues/23965
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31193

Differential Revision: D18966291

Pulled By: eellison

fbshipit-source-id: 6b1e18c293f55017868a14610d87b69be42bde12
2019-12-12 13:33:38 -08:00
066e3ed953 Re-apply "[bert/RoBERTa] Optimize LayerNorm with explicit vectorization using Vec256" (#31127)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31127

Original commit changeset: d22448b90843

On Skylake T6:

Single Core:
(Note that our benchmark generates batch_size=47 for first case and batch_size=56 for the second case. In spite of that, the vectorized version is still faster than the original reference C version without vectorization.)
- Before the PR:
```
native_layer_norm        0.81%            5.884ms          0.81%            5.884ms          122.580us        NaN              0.000us          0.000us          48               [[47, 1, 1024], [1024], [1024]]
```

- After the PR:
```
native_layer_norm        0.68%            5.053ms          0.68%            5.053ms          105.272us        NaN              0.000us          0.000us          48               [[56, 1, 1024], [1024], [1024]]
```

20 Cores:
- Before the PR:
```
native_layer_norm        1.65%            41.682ms         1.65%            41.682ms         868.365us        NaN              0.000us          0.000us          48               [[61, 64, 1024], [1024], [1024]]
```

- After the PR:
```
native_layer_norm        1.34%            33.829ms         1.34%            33.829ms         704.771us        NaN              0.000us          0.000us          48               [[61, 64, 1024], [1024], [1024]]
```
ghstack-source-id: 95420889

Test Plan:
buck test mode/dev-nosan //caffe2/test:nn -- "LayerNorm"

buck test mode/dev-nosan //caffe2/test:nn -- "test_LayerNorm_1d_no_elementwise_affine_eval"

 python run_test.py -i nn -- TestNN.test_LayerNorm_1d_no_elementwise_affine_eval

Differential Revision: D18936428

fbshipit-source-id: 8cae33d35fb338b5ac49b1597c2709152612d6e5
2019-12-12 13:31:12 -08:00
66f2bba852 Adding function to convert Module to channels last
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/28991

Test Plan: Imported from OSS

Differential Revision: D18430810

Pulled By: VitalyFedyunin

fbshipit-source-id: 0693d4e31fc6f9831722c29fc83517f16ddfc028
2019-12-12 11:38:35 -08:00
4ead2e8996 Fix CircleCI behavior for non-leaf stack PRs (#31088)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31088

Original issue:
https://github.com/pytorch/pytorch/issues/31027

The problem is that for the stacks of PRs for non-leaf PRs circleCI does not set environment variable `CIRCLE_PULL_REQUEST` which is used to filter out some jobs that should run only on `master`.

(Android job for master includes alll 4 abis (x86, x86_64, armeabi-v7a, arm64-v8a)  and gradle build tries to get results from all 4 abis, for PRs we run only x86 build for resources economy. Thats why not filtered master android job fails as abis apart x86 were not scheduled)

env variable `CIRCLE_BRANCH ` is set fine and can be used as a workaround to distinguish that this is PR (published with ghstack).

Test Plan: Imported from OSS

Differential Revision: D18966385

Pulled By: IvanKobzarev

fbshipit-source-id: 644c5ef07fcf2d718b72695da2cc303da8b94ef4
2019-12-12 11:33:14 -08:00
bcb0bb7e0e Remove unnecessary ATen/core/EnableNamedTensor.h (#31117)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31117

After this diff, we will have completely removed the named tensor
feature flagging. This means that named tensors are always on and that
there is no mechanism to turn them off. There should be no more follow-up
diffs.

I performed the deletion of the header with
```
find . -type f -print0 | xargs -0 sed -i '/#include
<ATen\/core\/EnableNamedTensor.h>/d'
```

Test Plan: - wait for CI

Differential Revision: D18934952

Pulled By: zou3519

fbshipit-source-id: 253d059074b910fef15bdf885ebf71e0edf5bea5
2019-12-12 09:53:07 -08:00
9047d4df45 Remove all remaining usages of BUILD_NAMEDTENSOR (#31116)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31116

Changelist:
- remove BUILD_NAMEDTENSOR macro
- remove torch._C._BUILD_NAMEDTENSOR
- remove all python behavior that relies on torch._C._BUILD_NAMEDTENSOR

Future:
- In the next diff, I will remove all usages of
ATen/core/EnableNamedTensor.h since that header doesn't do anything
anymore
- After that, we'll be done with the BUILD_NAMEDTENSOR removal.

Test Plan: - run CI

Differential Revision: D18934951

Pulled By: zou3519

fbshipit-source-id: 0a0df0f1f0470d0a01c495579333a2835aac9f5d
2019-12-12 09:53:03 -08:00
c0bcfd0445 Revert D18923167: Expose setNumThreads to android api
Test Plan: revert-hammer

Differential Revision:
D18923167

Original commit changeset: 8d98c2edbff4

fbshipit-source-id: 7db37cff298c511d0dd9eb373811c769e4a73be9
2019-12-12 09:23:58 -08:00
56de8853da Resubmit overload v2 (#31123)
Summary:
Resubmit of https://github.com/pytorch/pytorch/pull/30356 and https://github.com/pytorch/pytorch/pull/31014 :'(

The last commit contains the fix. There was an internal FBcode error not able to compile the previous `impl_default->second.equal(default_val.second))` line. I tried various fixes in C++ internally but couldn't figure anything out. This is a good example of the programming costs of going from python -> c++ for different types of objects, because the conceptual overhead has expanded in scope from (python) -> (python, c++, pybind).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31123

Differential Revision: D18936128

Pulled By: eellison

fbshipit-source-id: 7d8fd66a6dd4a3e9838f3a0b68c219b6565a9462
2019-12-12 07:54:23 -08:00
3a02ed822b Remove insert_prepack_unpack and fold_prepack for now (#30909)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30909

`fold_prepack` doesn't work anymore after we change `scale`, `zero_point`
to be attributes, but since the freeze API is coming up, I don't want to
spend time to make this work since this will be thrown away later.

Test Plan:
.

Imported from OSS

Differential Revision: D18864537

fbshipit-source-id: 649e6b91f2b04b8babacc0afb6bc1530ed7259d3
2019-12-12 07:44:31 -08:00
159835e666 Add types for the remaining optimizers. (#31130)
Summary:
**Patch Description**
Round out the rest of the optimizer types in torch.optim by creating the stubs for the rest of them.

**Testing**:
I ran mypy looking for just errors in that optim folder. There's no *new* mypy errors created.
```
$ mypy torch/optim | grep optim
$ git checkout master; mypy torch/optim | wc -l
968
$ git checkout typeoptims; mypy torch/optim | wc -l
968
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31130

Reviewed By: stephenroller

Differential Revision: D18947145

Pulled By: vincentqb

fbshipit-source-id: 5b8582223833b1d9123d829acc1ed8243df87561
2019-12-12 06:36:41 -08:00
2488231fe3 Tweak pollTimedOutRPCs thread synchronization (#30355)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30355

- Make processTimedOutFutures hold lock.
- Reduce unnecessary scan on future and future timeout maps.
- Reduce the scope of lock at a spot.
- Avoid repeatedly wake up if user set timeout = 0.

ghstack-source-id: 95409528

Test Plan:
# Unit tests

```
buck test mode/dev-nosan //caffe2/test:rpc_fork -- test_rpc_timeouts

buck-out/gen/caffe2/test/rpc_fork\#binary.par -r test_rpc_timeouts
```

```
buck test mode/dev-nosan //caffe2/test:rpc_fork_thrift -- test_rpc_timeouts

buck-out/gen/caffe2/test/rpc_fork_thrift\#binary.par -r test_rpc_timeouts
```

Differential Revision: D5516149

fbshipit-source-id: 4bb0bd59fa31d9bfaef9f07ac0126782da17f762
2019-12-11 22:02:32 -08:00
0db6c01301 Re-enable python 2 builds (#31164)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31164

We have a small number of internal projects that still are on Python 2.
Until we can figure out how to get rid of them, we need to continue
supporting Python 2 for PyTorch.

Test Plan: Imported from OSS

Differential Revision: D18949698

Pulled By: suo

fbshipit-source-id: 4a9d7e4306ed81576e05f243de472937a2bb1176
2019-12-11 22:02:28 -08:00
4f5a4be45f Add native/quantized to the list of header rewrites (#31151)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31151

same as title. I am not sure why this was not added in the first place.

Test Plan: wait for build to succeed.

Reviewed By: bddppq, xw285cornell

Differential Revision: D18880216

fbshipit-source-id: 8b17d4fbd5dd08c28c52df8b1da77b69d56d65dc
2019-12-11 21:59:29 -08:00
6ab2d1b1a4 Partially support tensor lists in loop/concat/stack (#30126)
Summary:
This is a follow-up PR after https://github.com/pytorch/pytorch/pull/29136 ~~and https://github.com/pytorch/pytorch/pull/29171~~

ONNX::Loop does not support Sequence type as loop-carried dependencies. Only tensors are supported.
This PR adds a pass that converts Sequence loop-carried dependencies to scan_outputs.
In opset 11, only the below pattern is supported.
```
PTIR graph:
 ...
 %res.1 : Tensor[] = prim::ListConstruct()
 %res : Tensor[] = prim::Loop(%11, %22, %res.1)
   block0(%i.1 : Tensor, %res.6 : Tensor[]):
     ...
     %res.3 : Tensor[] = aten::append(%res.6, %17)
     -> (%22, %res.3)
 return (%res.3)

ONNX graph:
 ...
 %res : Tensor = onnx::Loop(%11, %22)
   block0(%i.1 : Tensor):
     ...
     -> (%22, %17)
 %res_seq : Tensor[] = onnx::SplitToSequence[keepdims=0](%res)
 return (%res_seq)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30126

Reviewed By: hl475

Differential Revision: D18946880

Pulled By: houseroad

fbshipit-source-id: 67ee65700513e8a942344a3d647e2e73c19ee3d2
2019-12-11 21:24:41 -08:00
a3ed350eb2 Change type of timeoutFutures_ key to time_point instead of duration (#31078)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31078

Make `ProcessGroupAgent::pollTimedOutRPCs` code more conventional.

- Use `std::chrono::time_point` to represent `endTime` instead of `std::chrono::duration`.
- Replace `std::condition_variable::wait_for(lock, endTime)` with `std::condition_variable::wait_until(lock, endTime)`.
- Remove the unnecessary `::getRPCRemainingTime()`.
ghstack-source-id: 95408482

Test Plan:
# Unit tests

```
buck test mode/dev-nosan //caffe2/test:rpc_fork -- test_rpc_timeouts

buck-out/gen/caffe2/test/rpc_fork\#binary.par -r test_rpc_timeouts
```

```
buck test mode/dev-nosan //caffe2/test:rpc_fork_thrift -- test_rpc_timeouts

buck-out/gen/caffe2/test/rpc_fork_thrift\#binary.par -r test_rpc_timeouts
```

Differential Revision: D5705442

fbshipit-source-id: ba54b7bdb84bc02d05c22360b01290d044bbfcf5
2019-12-11 21:01:31 -08:00
49a5841a9f Make Conv{1,2,3}dOptions and ConvTranspose{1,2,3}dOptions different classes (#31005)
Summary:
Currently, both `Conv{1,2,3}dOptions` and `ConvTranspose{1,2,3}dOptions` are aliases of the `ConvOptions<{1,2,3}>` class, which causes confusion because the `ConvOptions` class has parameters such as `transposed` that shouldn't be exposed to the end user. (This has caused issues such as https://github.com/pytorch/pytorch/issues/30931.) This PR makes the following improvements:
1. Rename the original `torch::nn::ConvOptions<N>` class to `torch::nn::detail::ConvNdOptions<N>` class, to signify that it's an implementation detail and should not be used publicly.
2. Create new classes `torch::nn::ConvOptions<N>` and `torch::nn::ConvTransposeOptions<N>`, which have parameters that exactly match the constructor of `torch.nn.Conv{1,2,3}d` and `torch.nn.ConvTranspose{1,2,3}d` in Python API.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31005

Differential Revision: D18898048

Pulled By: yf225

fbshipit-source-id: 7663d646304c8cb004ca7f4aa4e70d3612c7bc75
2019-12-11 20:31:48 -08:00
85107e72b4 Fix type unification With Specialized Tensor Shapes (#31076)
Summary:
Fix for https://github.com/pytorch/pytorch/issues/30015

We had a model that failed in shape propagation because we could not unify `Tensor` and `Optional[BoolTensor]`. Tensor not subtyping Optional[BoolTensor] was correct, but we should have unified those two types to `Optional[Tensor]`.
 The fix here is that for immutable types containers (Optional, Tuple Type), we should be attempting to unify with complete shape information, and if that fails, then try to unify those types with unshaped types.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31076

Differential Revision: D18921802

Pulled By: eellison

fbshipit-source-id: aa6890277470c60b349ed1da4d81cc5d71d377f6
2019-12-11 20:11:34 -08:00
97c1e90f46 ONNX Interpolate Add Scales Params (#28324)
Summary:
Fix for : https://github.com/pytorch/pytorch/issues/27176
Pull Request resolved: https://github.com/pytorch/pytorch/pull/28324

Reviewed By: hl475

Differential Revision: D18309133

Pulled By: houseroad

fbshipit-source-id: 348bb41393442c6b107d88fc2cd3224e0afa3ccf
2019-12-11 20:09:15 -08:00
79c27ba4ef Add ONNX Export Support to floor_divide (#31081)
Summary:
Adding support for the new ATen op floor_divide which was introduced in https://github.com/pytorch/pytorch/pull/30493/files.

This operation is used in Torchvision/FasterRCNN-MaskRCNN, which are now failing after the new op was introduced.
This PR fixes the failure.

cc: neginraoof
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31081

Reviewed By: houseroad

Differential Revision: D18945316

Pulled By: eellison

fbshipit-source-id: 09919c237d618ce7db293c7770f48f7304949dcf
2019-12-11 19:39:11 -08:00
d81c6bde3b Updating submodules
Summary:
GitHub commits:

36ab9debf5
55e5070f0a
5fed1a6da7
9f0f470fce
e1dfe80fe0
786d2c588c
6c2b9d596d

Test Plan: n/a

Reviewed By: zpao

fbshipit-source-id: 1242688c93ba233f19f3afac174c814ae4c455dc
2019-12-11 18:58:37 -08:00
efe683fb2a dynamicly quantized linear benchmarking
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/30148

Test Plan: Imported from OSS

Differential Revision: D18613006

Pulled By: z-a-f

fbshipit-source-id: 3851189a2822fd09a5dd97c9d54774727822d2bf
2019-12-11 18:39:57 -08:00
73f9e81660 Make rref fetch calls async. (#31086)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31086

This change leverages the new future response framework so that server
threads don't block until setValue is called. Particulurly, we add a
getFuture() method to OwnerRRef so that we get a future that is satisfied
once setValue is called.
ghstack-source-id: 95402273

Test Plan: buck test mode/dev-nosan caffe2/test/...

Differential Revision: D18925272

fbshipit-source-id: 2caf51019e5b5fd7ec45539544780067deb28610
2019-12-11 18:30:09 -08:00
679b20b1e4 Unify list elements for all list types (#30777)
Summary:
Previously list elements were only unified for tensor lists.
This improves error messages and expands the unification logic
to include all types.
](https://our.intern.facebook.com/intern/diff/18837726/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30777

Pulled By: driazati

Differential Revision: D18837726

fbshipit-source-id: c4d275562a8429700987569426d694faa8f6002e
2019-12-11 17:00:52 -08:00
0414463007 doc fix for max method: a warning about different behaviour on CPU and GPU (#31115)
Summary:
Fixes [30708](https://github.com/pytorch/pytorch/issues/30708),
Adds warning regarding different behaviour of the method depending on device type.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31115

Differential Revision: D18937365

Pulled By: zou3519

fbshipit-source-id: 7c731dd80f8b371de08d7fdfcc2196be15a593e1
2019-12-11 16:02:33 -08:00
e5a550cd1d Fix Test CI by pinning hypothesis and correcting the import (#31137)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31137

Our Test CI is broken because:
- hypothesis recently did a new release that reorganized their internal
modules
- we were importing something from their internal module structure.

This PR fixes the CI by doing the following:
- import SearchStrategy from the correct (public) location
- Pin the hypothesis version to avoid future surprises.

In the long term, we should stop install hypothesis every time the CI
runs and instead install it as a part of our docker build process. See
https://github.com/pytorch/pytorch/issues/31136 for details.

Test Plan:
- I tested this locally; before this PR test/test_nn.py fails to run but
after it does run.
- Wait for CI

Differential Revision: D18940817

Pulled By: zou3519

fbshipit-source-id: c1ef78faa5a33ddf4d923f947c03cf075a590bb8
2019-12-11 15:42:59 -08:00
945ce71b18 Correctly handle scalar types, fix parse of numpy ints (#30486)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30486

Fixes: https://github.com/pytorch/pytorch/issues/29252

There is some incorrect code in the handling of parsing python numbers that led to issue #29252:

When we allow interpretation of a zero-dim numpy integer value
as a scalar in pytorch, we incorrectly parse the int as a float.

This PR also fixes the issue described in the "FIXME" here:
https://github.com/pytorch/pytorch/pull/27628/files#diff-f539198dd366265fb8dc2d661bc5d5bcR1487

Test Plan: Added a unit test based on the example given in the issue.

Differential Revision: D18932520

Pulled By: nairbv

fbshipit-source-id: f6416f28dfd73ac72c1042042851d76beb5fcf65
2019-12-11 15:35:57 -08:00
293a139d79 add a warning for script classes (#31069)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31069

Just to clarify that they are still experimental.

Test Plan: Imported from OSS

Differential Revision: D18920496

Pulled By: suo

fbshipit-source-id: d2f3014592a01a21f7fc60a4ce46dd0bfe5e19e9
2019-12-11 14:48:55 -08:00
6225443009 Expose setNumThreads to android api (#31033)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31033

Intention:
There are requests from users to control number of threads from android side:
https://discuss.pytorch.org/t/android-pytorch-forward-method-running-in-a-separate-thread-slow-down-ui-thread/63516/2
https://discuss.pytorch.org/t/threading-of-model-pytorch-android/62490/2

At the moment `setNumThreads` is placed in `org.pytorch.Module`, but this method changes global threadPool size, in future we will move it to some separate class to repeat python binding structure, which has torch.set_num_threads()

Test Plan: Imported from OSS

Differential Revision: D18923167

Pulled By: IvanKobzarev

fbshipit-source-id: 8d98c2edbff42e9b673509672dce3f2dd03a923e
2019-12-11 14:20:14 -08:00
06d874f95b Change startTime_ to endTime_ in FutureInfo (#30342)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30342

This can eliminate the unnecessary calls to getRPCEndTime(). Reduce lines of code for simplicity.

ghstack-source-id: 95377162

Test Plan:
# Unit tests

```
buck test mode/dev-nosan //caffe2/test:rpc_fork -- test_rpc_timeouts

buck-out/gen/caffe2/test/rpc_fork\#binary.par -r test_rpc_timeouts
```

```
buck test mode/dev-nosan //caffe2/test:rpc_fork_thrift -- test_rpc_timeouts

buck-out/gen/caffe2/test/rpc_fork_thrift\#binary.par -r test_rpc_timeouts
```

Differential Revision: D5705624

fbshipit-source-id: aca4c4917718124022c09ee0d13cf5ca483402af
2019-12-11 14:04:49 -08:00
7a8261e962 Updating submodules
Summary:
GitHub commits:

06033e7eb2
c56d2fa73f
972f299a62
3717a88289
ea64a080c6
b4e0237162

Test Plan: n/a

Reviewed By: zpao

fbshipit-source-id: 73d2d91c851f1905d6d4606a9f8002eb47246852
2019-12-11 12:52:00 -08:00
4b2d356ac1 Re-enable test_rref_context_debug_info after enforcing proper synchronization (#30994)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30994

The flakiness we saw was due to missing barriers(), which caused
states leaked into previous or subsequent checks. This commit
attempts fix this problem by adding barriers before and after each
check.

Test Plan: Imported from OSS

Differential Revision: D18893457

Pulled By: mrshenli

fbshipit-source-id: 42bcc12efa7e6e43e2841ef23e4bc2543b0236c6
2019-12-11 12:38:14 -08:00
5b03ff0a09 Update embedding renorm comment to reference fixed issue (#29140)
Summary:
Address last comment in https://github.com/pytorch/pytorch/issues/28546
Pull Request resolved: https://github.com/pytorch/pytorch/pull/29140

Differential Revision: D18915091

Pulled By: albanD

fbshipit-source-id: 756ff5bb6a92d47c80aa9f96ff6f0edea5fd24de
2019-12-11 11:58:55 -08:00
dbc8b00816 Document WorkerInfo and RpcBackendOptions structures in RPC docs. (#31077)
Summary:
We mention `WorkerInfo` and `RpcBackendOptions` in a couple of different locations in our docs, and these are public classes that the user may use, so we should add the class to the documentation.
<img width="978" alt="Screen Shot 2019-12-10 at 1 42 22 PM" src="https://user-images.githubusercontent.com/8039770/70571759-47db2080-1b53-11ea-9d61-c83985a29dd9.png">
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31077

Differential Revision: D18928162

Pulled By: rohan-varma

fbshipit-source-id: 67f11eedd87523c469377b791a0ba23704ec3723
2019-12-11 11:39:57 -08:00
4a751dfc20 optimize MulGradient for common shapes (#19705)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/19705

Optimizing for a case when there's a consecutive dims that are not broadcasted followed by another consecutive dims that are broadcasted.
For example, MulGradient(["dC", "A", "B"], ["dA", "dB"], broadcast=True, axis=0) where A.shape == dC.shape == [9508, 80] and B.shape == [80] .

Test Plan:
In SKL T6,

Running mul_gradient_benchmark without this optimization
Operator #0 (dA, MulGradient) 11.9119 ms/iter

After this optimization,
Operator #0 (dA, MulGradient) 0.672759 ms/iter

Need to land D15291800 before to fix the unit test error

Reviewed By: dmudiger

Differential Revision: D15075415

fbshipit-source-id: 0f97be17cf8f1dacbafa34cd637fb8bc1c5e5387
2019-12-11 11:39:52 -08:00
a53b39f09d Disable flaky test_process_group_debug_info
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/31113

Test Plan: Imported from OSS

Differential Revision: D18932365

Pulled By: mrshenli

fbshipit-source-id: a2996b6a8d3881be4ffc174b85509aeee8c51c96
2019-12-11 11:36:58 -08:00
44ecc3a70b Add tracing support for optional Device and Layout (#30979)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30979

This stack is a first step toward an effort to fix, clean up and simplify code generation logic. �Please see the master [task](https://github.com/pytorch/pytorch/issues/30405) to see related discussions and all the known issues.

Main focus of these changes is TensorOptions in code generation.
Goals:
- Remove TensorOptions from generated code wherever it's possible. Leave it only in python/C++ API layers.
- Refactor TensorOptions logic to a single place.
- Log all discovered issues.

Non goals:
- Fix Everything!
- Remove all the hacks in code generation scripts.
- Clean up and defector all code generation scripts.

--------------
In this PR:
Add tracing support for optional Device and Layout types.

--------------

Test Plan: Imported from OSS

Differential Revision: D18912685

Pulled By: izdeby

fbshipit-source-id: 4a9514ce2eee0041f9bc96636d3ddb4f077675e1
2019-12-11 11:32:52 -08:00
672f4cfad9 Added C++ API test (#30980)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30980

This stack is a first step toward an effort to fix, clean up and simplify code generation logic. �Please see the master [task](https://github.com/pytorch/pytorch/issues/30405) to see related discussions and all the known issues.

Main focus of these changes is TensorOptions in code generation.
Goals:
- Remove TensorOptions from generated code wherever it's possible. Leave it only in python/C++ API layers.
- Refactor TensorOptions logic to a single place.
- Log all discovered issues.

Non goals:
- Fix Everything!
- Remove all the hacks in code generation scripts.
- Clean up and defector all code generation scripts.

--------------
In this PR:
Add a test to check that C++ API behavior stays the same after all the changes.
While working on it a bug related to `requires_grad` was found and logged in the master task.

--------------

Test Plan: Imported from OSS

Differential Revision: D18912681

Pulled By: izdeby

fbshipit-source-id: 19772a37c92dde820839b79055f348689b99fa77
2019-12-11 11:21:05 -08:00
1f87e823b8 Make nn.Transformer TorchScript compatible (#28561)
Summary:
This makes `nn.Transformer` usable from TorchScript. It preserves backwards compatibility via `__setstate__` on the encoder/decoder.

Fixes https://github.com/pytorch/pytorch/issues/24173
Pull Request resolved: https://github.com/pytorch/pytorch/pull/28561

Differential Revision: D18124753

Pulled By: driazati

fbshipit-source-id: 7314843e5aa9c9bf974c4672e4edb24ed8ef4a6f
2019-12-11 10:57:31 -08:00
a929d312ac Add dill>=0.3.1 as testing dependency (#31121)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31121

For https://github.com/pytorch/pytorch/pull/30985 .

Test Plan:
- run `pip install "dill>=0.3.1"` locally, check that it actually
installs dill>=0.3.1.

Differential Revision: D18934871

Pulled By: zou3519

fbshipit-source-id: 688a489b9e81134ccb5ab4b099116e3fe6b6b7ae
2019-12-11 10:33:00 -08:00
3593981976 Updating submodules
Summary:
GitHub commits:

9b38c6430e

Test Plan: n/a

Reviewed By: zpao

fbshipit-source-id: 8801c415c9b00bec46efc102c0daceba59397449
2019-12-11 09:50:33 -08:00
717274c001 Add useful warnings for t.grad when it won't be populated for known reasons (#30531)
Summary:
Fix https://github.com/pytorch/pytorch/issues/2362 and https://github.com/pytorch/pytorch/issues/19778

To avoid issues with frozen model, we only consider warning for Tensors that require gradients and are neither leafs nor retain gradients.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30531

Differential Revision: D18832767

Pulled By: albanD

fbshipit-source-id: 743e863dc14ab57713e66da78b2e4d759dfba0ff
2019-12-11 09:47:18 -08:00
3301794855 Port ELU activation to Aten (#29275)
Summary:
VitalyFedyunin, This PR is about port  ELU activation to Aten:
**Test script:**
```
import torch
import torch.nn as nn
import time

torch.manual_seed(0)
def _time():
    if torch.cuda.is_available():
        torch.cuda.synchronize()
    return time.time()

device = "cpu"
m = nn.ELU()
if torch.cuda.is_available():
    device = "cuda"
    m = m.cuda()

#warm up
for n in [100, 10000]:
    input = torch.randn(128, n, requires_grad=True, device=device)
    grad_output = torch.ones(128, n, device=device)
    for i in range(1000):
        output = m(input)
        output.backward(grad_output)

for n in [100, 10000]:
    fwd_t = 0
    bwd_t = 0
    input = torch.randn(128, n, requires_grad=True, device=device)
    grad_output = torch.ones(128, n, device=device)
    for i in range(10000):
        t1 = _time()
        output = m(input)
        t2 = _time()
        output.backward(grad_output)
        t3 = _time()
        fwd_t = fwd_t + (t2 -t1)
        bwd_t = bwd_t + (t3 - t2)
    fwd_avg = fwd_t / 10000 * 1000
    bwd_avg = bwd_t / 10000 * 1000
    print("input size(128, %d) forward time is %.2f (ms); backwad avg time is %.2f (ms)."
          % (n, fwd_avg, bwd_avg))
```
Test Device: CPU: skx-8180, GPU: Tesla P40.
Perfromance:
Before:
```
GPU:
input size(128, 100) forward time is 0.04 (ms); backwad avg time is 0.09 (ms).
input size(128, 10000) forward time is 0.06 (ms); backwad avg time is 0.17 (ms).
CPU:
OMP_NUM_THREADS=56
input size(128, 100) forward time is 0.28 (ms); backwad avg time is 0.18 (ms).
input size(128, 10000) forward time is 23.53 (ms); backwad avg time is 14.46 (ms).
OMP_NUM_THREADS=1
input size(128, 100) forward time is 0.16 (ms); backwad avg time is 0.08 (ms).
input size(128, 10000) forward time is 15.53 (ms); backwad avg time is 6.60 (ms).
```
After:
```
GPU:
input size(128, 100) forward time is 0.05 (ms); backwad avg time is 0.11 (ms).
input size(128, 10000) forward time is 0.06 (ms); backwad avg time is 0.17 (ms).
CPU:
OMP_NUM_THREADS=56
input size(128, 100) forward time is 0.24 (ms); backwad avg time is 0.17 (ms).
input size(128, 10000) forward time is 0.73 (ms); backwad avg time is 1.11 (ms).
OMP_NUM_THREADS=1
input size(128, 100) forward time is 0.15 (ms); backwad avg time is 0.07 (ms).
input size(128, 10000) forward time is 14.40 (ms); backwad avg time is 6.00 (ms).
```
How to set the numbers of thread? using following script:
```
num_threads=$1
script=$2
last_core=`expr $num_threads - 1`
echo "using $num_threads OMP threads"
echo "bind cores to 0~$last_core"
export OMP_NUM_THREADS=$num_threads
export KMP_AFFINITY=granularity=fine,compact,1,0
numactl --physcpubind=0-$last_core --membind=0 python $script
```
and run .**/run.sh num_threads test.py**.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/29275

Differential Revision: D18587389

Pulled By: VitalyFedyunin

fbshipit-source-id: bea8f3f006c6893090f863d047c01886d195437a
2019-12-11 09:44:34 -08:00
4aa30d3c0c Revert D18293522: Optimize LayerNorm with explicit vectorization using Vec256
Test Plan: revert-hammer

Differential Revision:
D18293522

Original commit changeset: f4cfed6e62ba

fbshipit-source-id: cdd6d9d36c00b516aecdab549abeeffc4a473829
2019-12-11 08:55:28 -08:00
9305f44854 Remove BUILD_NAMEDTENSOR from codegen and .cu files (#31047)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31047

Changelist:
- remove BUILD_NAMEDTENSOR from .cu files
- remove BUILD_NAMEDTENSOR special handling in function_wrapper.py
- remove BUILD_NAMEDTENSOR from cpp_extension.py. This code actually
did nothing because we always compile with BUILD_NAMEDTENSOR.

Test Plan: - run tests

Differential Revision: D18908442

Pulled By: zou3519

fbshipit-source-id: b239e24de58580adaf3cef573350773a38b1e4f0
2019-12-11 08:49:56 -08:00
65f6e449c7 Updating submodules
Summary:
GitHub commits:

0f94976f31
be15abd839
034086d70f
aa131abdf5
a3f268f1b5
6394aabc99

Test Plan: n/a

Reviewed By: zpao

fbshipit-source-id: fa99a0a096de1f088e5fa8cd92fdf5fd6c330740
2019-12-11 07:25:34 -08:00
d6d6075573 Optimize LayerNorm with explicit vectorization using Vec256 (#29104)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/29104

We would like to provide the vectorized implementation for layer norm. This PR reuses https://github.com/pytorch/pytorch/pull/23349.

Test Plan:
buck test mode/dev-nosan //caffe2/test:nn -- "LayerNorm"

buck test mode/dev-nosan //caffe2/test:nn -- "test_LayerNorm_1d_no_elementwise_affine_eval"

 python run_test.py -i nn -- TestNN.test_LayerNorm_1d_no_elementwise_affine_eval

Differential Revision: D18293522

fbshipit-source-id: f4cfed6e62bac1b43ee00c32b495ecc836bd9ec5
2019-12-11 06:01:45 -08:00
28ee309c9a disable onnx py3 gcc5 build (#31100)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31100

This appears to not work right now. Disabling pending an investigation.

Test Plan: Imported from OSS

Differential Revision: D18928777

Pulled By: suo

fbshipit-source-id: 63089131bad98902979e5cf4373732c85badef9d
2019-12-11 00:26:15 -08:00
8013ffd400 Fix weight_norm export for dim=0 (#31015)
Summary:
Exported weight_norm is incorrectly reducing over axis 0 as well when dim is set to 0.
Previous test case only covers weight with size(0) == 1, which yields the same result whether reduced over or not.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31015

Reviewed By: hl475

Differential Revision: D18900894

Pulled By: houseroad

fbshipit-source-id: 19004f51933b37f848dbe4138e617a7a8e35a9ec
2019-12-10 23:43:56 -08:00
9a5fd2eb07 Fix conflicts in CMAKE_GENERATOR and generator (#30971)
Summary:
...specified in -G

https://cmake.org/cmake/help/latest/variable/CMAKE_GENERATOR.html
According to the document, the generator could be determined through two methods:
1. Specify in `-G`
2. Read from `CMAKE_GENERATOR`

We should avoid conflicts in these two methods. This fixes https://github.com/pytorch/pytorch/issues/30910.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30971

Differential Revision: D18927529

Pulled By: mingbowan

fbshipit-source-id: e9a179ceb32d6fbabfaeac6cfe9e6170ca170b20
2019-12-10 22:22:26 -08:00
7f5f2e8871 add ZERO_COLLISION_HASH to caffe2 data type (#30912)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30912

Add a new data type ZERO_COLLISION_HASH .

Test Plan: ci

Reviewed By: boryiingsu

Differential Revision: D18843626

fbshipit-source-id: b2d8280f13c78b4a656cf95822198df59de7b64c
2019-12-10 21:36:24 -08:00
c72dd526a7 kill py2 onnx builds
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/31082

Differential Revision: D18922689

Pulled By: suo

fbshipit-source-id: 98c91b90ee3b1dd13c6020597a0ace741a1597da
2019-12-10 20:25:42 -08:00
9f3fe78239 peephole optimize type refinements (#31024)
Summary:
Peephole optimize out type refinements when they are no longer refining the type.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31024

Differential Revision: D18920958

Pulled By: eellison

fbshipit-source-id: 6d05d9812b9f9dcf001de760a78a2042fb832773
2019-12-10 18:32:28 -08:00
d02280b432 move migration guide to appendix (#31068)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31068

Let's get it out of the early parts now that the recursive API has been
around for a while

Test Plan: Imported from OSS

Differential Revision: D18920498

Pulled By: suo

fbshipit-source-id: 6f4389739dd9e7e5f3014811b452249cc21d88e7
2019-12-10 18:04:02 -08:00
d088bd0bad Updating submodules
Summary:
GitHub commits:

c6506e2698
4427c1a832
a653857178
558f42bd6c
3839cbaf52

Test Plan: n/a

Reviewed By: zpao

fbshipit-source-id: 4a253bba6de9a2c2a11a82e33809a370e1b4fd04
2019-12-10 16:58:08 -08:00
e7e6d56b77 Allow async work in rpc RequestCallback processing. (#30637)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30637

RequestCallback api currently forces work to be always synchronous, which,
as we scale, means we're going to need to throw large number of (mostly
blocked) threads at the rpc problem. For some activities like dependent
autograd rpcs, there's not a necessary reason to block in these threads.

In this change, the RequestCallback api is updated to return a
shared_ptr<FutureMessage> rather than a Message:

   std::shared_ptr<FutureMessage> operator()(Message& request) const;

With a futures-style api, RPC ops that wish to be async can then be async,
while short-lived blocking functions (or Python UDFs) can just block.

In this change, we keep all of the current ops as synchronous (i.e. we block
and then return a completed FutureMessage). We also update the rpc_agents in
a manner compatible with this sort of parallelism.

Here, we only want to incur overhead when we use the async behavior.
Some modest extra cost seems unavoidable here (e.g. the allocation for the
std::make_shared<>), but we can trivially detect the synchronous/completed
case in the rpc_agent and avoid the extra thread-switches/etc. in that case.
ghstack-source-id: 95287026

Test Plan:
- Basic: buck test mode/dev-nosan caffe2/test/...
  - Additional testcase in ThriftRpcAgentTest for deferred work.

Differential Revision: D18774322

fbshipit-source-id: cf49922a71707cfb1726de16f93af23b160385d8
2019-12-10 16:11:05 -08:00
e42af97349 Add quantized concat conversion (#30887)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30887

Support to convert quantized concat from pytorch to caffe2

Test Plan:
python test/onnx/test_pytorch_onnx_caffe2_quantized.py TestQuantizedOps.test_cat

Imported from OSS

Differential Revision: D18855676

fbshipit-source-id: 5d0cf3f03c61819e168b080afa368b1255d0419c
2019-12-10 15:46:16 -08:00
3de8584de8 Correct definition of nodes that work with Autograd (#30683)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30683

Assume that a node can work with autograd only if it is not a fusion
group and in prim or aten namespaces.

Test Plan: CI

Reviewed By: lly-zero-one

Differential Revision: D18795171

Pulled By: ilia-cher

fbshipit-source-id: 301090557e330b58be70e956784f7f0dc343c684
2019-12-10 15:39:38 -08:00
b7652a2f81 remove py2 flake8 lint (#29357)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/29357

As title

Test Plan: Imported from OSS

Reviewed By: pritamdamania87

Differential Revision: D18920562

Pulled By: suo

fbshipit-source-id: b5dd559cfb0ba6c64b9ccf3655417afb56a7b472
2019-12-10 15:31:10 -08:00
d113b22571 kill PyTorch py2 circle jobs (#29353)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/29353

First step to killing Python 2 everywhere. I don't really know that much
about the caffe2 circle jobs so I left them alone for now.

Test Plan: Imported from OSS

Differential Revision: D18920563

Pulled By: suo

fbshipit-source-id: b37d8427a6ecd4b8a7e16c1ff948e0ce13b5798f
2019-12-10 15:31:06 -08:00
5edfe9cb80 add torch.square (#30719)
Summary:
fixes https://github.com/pytorch/pytorch/issues/30524
This adds an new operator `torch.square` to PyTorch

I think it is ready for the first-time review now albanD
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30719

Differential Revision: D18909268

Pulled By: albanD

fbshipit-source-id: 5626c445d8db20471a56fc1d7a3490e77812662b
2019-12-10 15:22:46 -08:00
e3d40f857b Make nn.Module forward() type annotation more permissive (#31057)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31057

The current signature basically will always fail to type check, because
mypy enforces that the subclass method's input types must be "wider"
than their superclass method's input types (i.e. they can vary
contravariantly). And nothing is wider than `Any`.

This change makes it so that any input params are allowed in
`forward()`. Fixes #29099

Test Plan: Imported from OSS

Differential Revision: D18918034

Pulled By: suo

fbshipit-source-id: 9940e9f769b55d580d9d7f23abf6f88edb92627f
2019-12-10 14:36:13 -08:00
8fd85d70be Updating submodules
Summary:
GitHub commits:

163b6e2428
1d7a0e1a4b
b8031f09d7
7fd86a8f64

Test Plan: n/a

Reviewed By: zpao

fbshipit-source-id: 98b2487b39fb56641641c0947ed09f883755126a
2019-12-10 14:19:31 -08:00
ed20937231 Remove TensorImpl::maybe_zero_dim.
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/30878

Test Plan: Imported from OSS

Differential Revision: D18855989

Pulled By: gchanan

fbshipit-source-id: 44087b6136ec40d0a3de5b5a9f03c60d002a1107
2019-12-10 13:21:47 -08:00
0cbbe050bb Updating submodules
Summary:
GitHub commits:

b459fcc89f
2b060c1498
13a2c072c4

Test Plan: n/a

Reviewed By: zpao

fbshipit-source-id: 59fb11a977dcb7b2c09acb7fe997b0d5e52f27c4
2019-12-10 12:48:07 -08:00
cc319659e3 qnnpack TanH
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/31013

Test Plan: Imported from OSS

Differential Revision: D18898903

Pulled By: z-a-f

fbshipit-source-id: aa126a98627b808678f629f39853c3b9c70eb2bf
2019-12-10 12:23:37 -08:00
b01b05790e Fix memory leak due to circular dependency. (#31030)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31030

DistAutogradContext held a shared_ptr reference to RecvRpcBackward and
RecvRpcBackward held a shared_ptr reference to the context. This circular
dependency caused significant memory leaks. As a result, I'm changing the
reference in RecvRpcBackward to be a weak_ptr.

Test Plan: waitforbuildbot

Differential Revision: D18896389

fbshipit-source-id: e5bc588b6f998885854e3a67de1e82452e8475ce
2019-12-10 12:20:43 -08:00
57f29a44c7 Bug fix of the histogram observers (#30970)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30970

Check null tensors in the histogram observers

Test Plan: f154576636 vs f154820243

Reviewed By: hx89

Differential Revision: D18865771

fbshipit-source-id: 669c014d914525deee36142e12f013afaf3caf1d
2019-12-10 11:45:20 -08:00
27d7dba9ab Remove scalar_check specification and codegen. (#30874)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30874

These have all been disabled at this point, so there is no difference in the generated code.

Test Plan: Imported from OSS

Differential Revision: D18855990

Pulled By: gchanan

fbshipit-source-id: 03796b2978e23ef9060063f33241a1cbb39f1cf3
2019-12-10 11:41:20 -08:00
47033b49f3 Suppress XCode build warnings (#31000)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31000

## Summary

Add Fastlane configurations to suppress the build warnings from XCode.

Test Plan: Imported from OSS

Differential Revision: D18912489

Pulled By: xta0

fbshipit-source-id: f2c54d54a12ad2415695d1fcb1800684c7a9e560
2019-12-10 11:37:52 -08:00
2da3b9a0f6 Updating submodules
Summary:
GitHub commits:

fd8771904e
6bf51e234f
6380df5e10
696c2a2359

Test Plan: n/a

Reviewed By: zpao

fbshipit-source-id: 188670fcdc50ccf060eea137698ecfb45484e059
2019-12-10 11:23:13 -08:00
78a00d72b4 Revert D18899127: resubmit polish up overloads on free functions
Test Plan: revert-hammer

Differential Revision:
D18899127

Original commit changeset: 9049b8718926

fbshipit-source-id: c70a8aa4120aa757dce0926a8ab3cc5c92cd6041
2019-12-10 10:51:07 -08:00
394d2f7037 Fix the rendering of the doc of max. (#30779)
Summary:
Close https://github.com/pytorch/pytorch/issues/30731
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30779

Differential Revision: D18837317

Pulled By: zou3519

fbshipit-source-id: b9b5ba414756a68d4b39a7a7c2d89fee1e3c040f
2019-12-10 10:48:16 -08:00
313c211f3f Calling JITed 8 Bit Fused SLS in FBGEMM from C2 (#30926)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30926

Calling the JITed FBGEMM kernel for Fused 8 Bit Sparse Length Sum (Fused8BitRowwiseEmbeddingLookup)

Test Plan:
buck test  mode/dbg //caffe2/caffe2/python:lengths_reducer_fused_8bit_rowwise_ops_test

All tests pass.

Reviewed By: jspark1105

Differential Revision: D18058128

fbshipit-source-id: 0dfa936eb503712c39e53748e015fc156afde86f
2019-12-10 10:44:05 -08:00
bb7befb12c Support loading by blob in predictor
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/30805

Reviewed By: ipiszy

Differential Revision: D18827383

fbshipit-source-id: b97f958768618ca29a02b057667a9b4ee313ad3c
2019-12-10 10:34:14 -08:00
a42d093db2 FCTransposed to FbFCPacked (#29766)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/29766

Add FbgemmPackTranspose op to support the packing on FCTransposed weights

Add FCTransposed to FbFCPacked transformation to Dper fp16 exporter

Test Plan:
```
buck test mode/opt caffe2/caffe2/fb/fbgemm:fb_fc_packed_op_test
```

```
buck test mode/opt caffe2/caffe2/python:layers_test
```

Differential Revision: D18482306

fbshipit-source-id: e8f1947b3d0d04892293509ebf88742f5f0f5997
2019-12-10 10:18:21 -08:00
c34ef1aa2e Automatic update of fbcode/onnx to c08a7b76cf7c1555ae37186f12be4d62b2c39b3b (#30619)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30619

Previous import was fea8568cac61a482ed208748fdc0e1a8e47f62f5

Included changes:
- **[c08a7b76](https://github.com/onnx/onnx/commit/c08a7b76)**: doc: fix some typos at ONNXIFI (#2473) <Yorkie Liu>
- **[4be12d46](https://github.com/onnx/onnx/commit/4be12d46)**: remove workshop update since it is done (#2460) <Prasanth Pulavarthi>
- **[86107d1b](https://github.com/onnx/onnx/commit/86107d1b)**: Updated with correct URL to LICENSE (#2468) <Ryan Loney>
- **[9bf6fbb6](https://github.com/onnx/onnx/commit/9bf6fbb6)**: Update Argmin/Argmax (#2461) <Lara Haidar>
- **[748d81b8](https://github.com/onnx/onnx/commit/748d81b8)**: Fix windows conda build (#2452) <Ashwini Khade>
- **[a32db1c5](https://github.com/onnx/onnx/commit/a32db1c5)**: Delete duplicate word in comment (#2439) <Haibo Hao>
- **[e108da9a](https://github.com/onnx/onnx/commit/e108da9a)**: Fix bug in function body verifier (#2390) <G. Ramalingam>
- **[c3d3ef82](https://github.com/onnx/onnx/commit/c3d3ef82)**: docs: fix typo in IR.md (#2441) <Elliot Waite>

Test Plan: ci

Reviewed By: hl475

Differential Revision: D18766132

fbshipit-source-id: 13c04f21399579acb87a8f9fac2e4c329b0720b8
2019-12-10 10:15:08 -08:00
06c7420fa2 Raise error if a block can not be found from a CUDA tensor (#30870)
Summary:
After several discussions, we agreed not to put any extra safety check for recordStream as either the check will cause failures in certain scenarios or there is no need to throw for user errors.

As a summary, it simply does what is described in https://github.com/pytorch/pytorch/issues/27405, check if a tensor is indeed allocated by a CUDACachingAllocator instance, if it is, then throw internal error if a block can not be retrieved.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30870

Differential Revision: D18851669

Pulled By: yxia11

fbshipit-source-id: c2f01798cd24f1fd0f35db8764057d5d333dab95
2019-12-10 08:04:00 -08:00
af4040d808 resubmit polish up overloads on free functions (#31014)
Summary:
Resubmitting https://github.com/pytorch/pytorch/pull/30356

Second commit has reintroduces deleted function which caused revert previously.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31014

Differential Revision: D18899127

Pulled By: eellison

fbshipit-source-id: 9049b8718926c329d9cb46bb96eac6c278e9b866
2019-12-10 07:57:47 -08:00
e05ee4c421 Remove BUILD_NAMEDTENSOR macros (#30894)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30894

This PR begins the process of removing BUILD_NAMEDTENSOR macros. There
will be followups.

Reasons for removing the macros:
- BUILD_NAMEDTENSOR is always on and has been on since pytorch 1.3.0.
- Since we don't test building without it, it is useless to keep around.
- Code becomes nicer to read without the macros

Reasons for not removing the macros:
- potential for feature flagging

Now, I argue against needing to feature flag. The main reason why we
might want to feature flag is if we need to disable the feature.
We'd need a fast switch to disable the feature if someone discovers
in the future that named tensors caused some regression in some existing workflows.

In https://github.com/pytorch/pytorch/pull/25798, I did a variety of
macro- and micro- benchmarks to determine the performance impact of named
tensors on regular tensors.

[The
microbenchmarks](https://github.com/pytorch/pytorch/pull/25798#issuecomment-529014810)
were not very stable, and running the
microbenchmarks for more iterations doesn't actually help because the
noise is not distributed in a nice way. Instead of microbenchmarks I ran
a [profiler
(perf)](https://github.com/pytorch/pytorch/pull/25798#issuecomment-555707645)
to estimate how much overhead named tensors add to unnamed code. I
estimated the overhead to be less than 100ns for `add` and even smaller
for `mm`; there are ways to optimize even futher if we find this to be a
problem.

[Initial
macrobenchmarks](https://github.com/pytorch/pytorch/pull/25798#issuecomment-530539104)
were also not very stable. I ran imagenet for some number of epochs. To
make them more stable, I got rid of the data loading (which seemed to
vary between runs). [In some benchmarkers without data
loading](https://github.com/pytorch/pytorch/pull/25798#issuecomment-562214053),
we can see that the results are less noisy now. These results support
no noticeable regressions in speed.

Test Plan: - wait for CI

Differential Revision: D18858543

Pulled By: zou3519

fbshipit-source-id: 08bf3853a9f506c6b084808dc9ddd1e835f48c13
2019-12-10 07:54:05 -08:00
f48a8901c5 Add floor_divide function (#30493)
Summary:
Adds `torch.floor_divide` following the numpy's `floor_divide` api. I only implemented the out-of-place version, I can add the inplace version if requested.

Also fixes  https://github.com/pytorch/pytorch/issues/27512
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30493

Differential Revision: D18896211

Pulled By: eellison

fbshipit-source-id: ee401c96ab23a62fc114ed3bb9791b8ec150ecbd
2019-12-10 07:51:39 -08:00
44428d0ee2 Updating submodules
Summary:
GitHub commits:

6c87dc4d3c
5ec43afc1d
1e3cb8283f
3af1c72471
dc8e6e6e68
405e596d50
f40ae54a52
479a143912
e63b40cb4b
cb5f0670a6
470a664def
6e8f70b2d9
0fb026ca58
3595e0cf38
79b171ffa3
fb5322d98d
cd48fc606b

Test Plan: n/a

Reviewed By: zpao

fbshipit-source-id: 99bee659ea0fca0247d67d2dac12a821e1bd402d
2019-12-10 07:45:23 -08:00
42324cb6e8 Change interface from map of TensorShape to shapeInfoMap (#30802)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30802

Change shape_hints from map<string, TensorShape> to ShapeInfoMap to catch dimType info from model file.

Reviewed By: ipiszy

Differential Revision: D18821486

fbshipit-source-id: c5d9ed72e158d3698aba38900aeda00f776745b4
2019-12-10 00:35:11 -08:00
5205556782 Export custom ops (#29752)
Summary:
Updated to export API:
When calling this API, a dict containing the custom opsets (domain and version) used to export the model could be provided.
We allow registering one custom opset (domain, version) per ONNX opset. So, when exporting an operator from a custom domain, users need to pass this pair. Default custom opset version is 1.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/29752

Reviewed By: hl475

Differential Revision: D18703662

Pulled By: houseroad

fbshipit-source-id: 84d22557d132b526169051193d730761798fce60
2019-12-09 18:48:50 -08:00
04b9324476 Factor out getInvokedMethod in InsertQuantDeQuantHelper (#30860)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30860

att

Test Plan:
.

Imported from OSS

Differential Revision: D18849021

fbshipit-source-id: e5ff260f2f4e88075b0c6b32ccfd8272053ccc41
2019-12-09 16:10:58 -08:00
fa6661422f Disable flaky test_rref_context_debug_info
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/30990

Test Plan: Imported from OSS

Differential Revision: D18893023

Pulled By: mrshenli

fbshipit-source-id: 80b36927f243fa53c4d64f7e7c51097290ffdeee
2019-12-09 15:55:51 -08:00
73dd8c005a Revert D18864774: polish up overloads on free functions
Test Plan: revert-hammer

Differential Revision:
D18864774

Original commit changeset: 6c566738bd6f

fbshipit-source-id: 669192605a3bc1a6ba06bbb5cae54f61637a45ae
2019-12-09 15:41:45 -08:00
446488960a polish up overloads on free functions (#30356)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30356

This finishes up the `torch.jit.overload` api for free-functions.
- defaults now required on the implementation function itself
- fully follows [overload spec](https://mypy.readthedocs.io/en/latest/more_types.html#function-overloading) such that the following is supported

```
overload
def mouse_event(x1: int, y1: int) -> ClickEvent: ...
def mouse_event(x1: int,
                y1: int,
                x2: Optional[int] = None,
                y2: Optional[int] = None): ...
```

Note: `jit.overload` isn't supported yet for UDT, but is support for modules. This PR doesn't make the same changes for modules, if reviewers think I should include them then I could do so in a follow up PR or wait to land this. Since that's still an internal api I think it's fine, and the changes here would allow us to expose `torch.jit.overload` on free functions.

Test Plan: Imported from OSS

Differential Revision: D18864774

Pulled By: eellison

fbshipit-source-id: 6c566738bd6f0551a000a9ea8d56e403636b7856
2019-12-09 15:12:18 -08:00
a03581b927 add tests that schemas are valid (#30749)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30749

Add check to schemas that the schema is sane.

I removed the defaults from symbolic_script because they were in some cases wrong and don't actually do anything. At the point they're invoked the forward should already have matched all arguments.

Test Plan: Imported from OSS

Differential Revision: D18864775

Pulled By: eellison

fbshipit-source-id: 273d7e96d65b8a3d3de72e2d7bfcdf2417046c6b
2019-12-09 15:12:13 -08:00
e9ca13d7f5 Add glue code to collect debug info from all components
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/30888

Test Plan: Imported from OSS

Differential Revision: D18857139

Pulled By: mrshenli

fbshipit-source-id: 5c1bfb83a21a4a57c4297bb94f14baa09520b791
2019-12-09 14:39:11 -08:00
8a57362000 Fix index out of bound error in Engine::ready_queue_size when called before start_threads
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/30967

Test Plan: Imported from OSS

Differential Revision: D18887178

Pulled By: mrshenli

fbshipit-source-id: 67baeac9214a4749ce7e9b4d89862c93620b2d5e
2019-12-09 14:39:07 -08:00
a38c9b1ade Adding debugging metrics to process group agent
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/30884

Test Plan: Imported from OSS

Differential Revision: D18857140

Pulled By: mrshenli

fbshipit-source-id: 4ec61d13778dd49467159d0db4b6dd51feaf282b
2019-12-09 14:39:03 -08:00
82268bf300 handle reassignment to inf and nan (#30877)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30877

Previously, when the environment tried to reassign variables which had been assigned to "inf" or "nan" it would fail because they are not simple values. Constant prop exposed this, a test was failing internally because of it.

Test Plan: Imported from OSS

Reviewed By: Krovatkin

Differential Revision: D18861016

Pulled By: eellison

fbshipit-source-id: b9b72978a26a0b00b13bf8ea7685825551f5a541
2019-12-09 14:20:17 -08:00
3eefc06feb add constant prop for immutable types (#30544)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30544

Run Constant Propagation upon compilation only on ops with non-aliasing inputs and outputs. This speeds up the first run of `torchvision.models.resnet18` by over 50% and speeds up compilation by about 25% (although the effects didn't seem additive with with https://github.com/pytorch/pytorch/pull/30503, so I'm going to land this PR first and then see if caching still has a sizable impact).

Running constant prop only with non-aliasing types does a lot of graph cleanup by removing constant ifs and a bunch of other smaller ops. It also avoids all the jitter problems we had when we tried running full constant prop previously. Bc it is idempotent it doesn't jitter, and it doesn't jitter graphs constructed from tracing because tracing doesn't emit any ops that only involve non-aliasing inputs.

Full constant prop isn't idempotent because what ops are run depends on the state of mutation in alias db, which will often change upon successive iterations of constant propagation, and bc it affects graphs constructed from tracing.

Edit: if we were okay with running constant propagation on graphs constructed from tracing (potentially making them hard to debug), an alternative would be to run constant propagation until the graph reaches a fixed point.

Test Plan: Imported from OSS

Differential Revision: D18833607

Pulled By: eellison

fbshipit-source-id: 92a0adb4882d67ed5a0db5c279f5e122aeeba54a
2019-12-09 14:20:12 -08:00
648bb501a1 rename shouldAnnotate api (#30543)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30543

`shouldAnnotate` doesn't make make a ton of sense as a public api

Test Plan: Imported from OSS

Differential Revision: D18833608

Pulled By: eellison

fbshipit-source-id: 460ee05d0fa91b1edc640c037be2a6ee8eaf50a6
2019-12-09 14:20:07 -08:00
45f0556ba0 Proper print for one element tuple (#30853)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30853

Right now we print one element tuple as `(val)`, and it will
be interpreted as `val` in parsing, this PR changes it
to `(val,)` so we can recognize the one element tuple in parsing

Test Plan:
.

Imported from OSS

Differential Revision: D18846849

fbshipit-source-id: 42959b9190c2567ef021a861497077c550324b7c
2019-12-09 14:15:40 -08:00
5bf58274cc getQParams return a dictionary of qparams (#30859)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30859

We can dictionary of quantization parameters to simplify the code
handling these things a bit

Test Plan:
.

Imported from OSS

Differential Revision: D18849023

fbshipit-source-id: 09e9860b2656a1affa8776016e16794529bcee3b
2019-12-09 13:42:21 -08:00
fb36f1c334 Updating submodules
Summary:
GitHub commits:

0f96b98cec
8090b337a4
e43d2c4424
70d1c268bf
fc6140865b
4caba2ed65

Test Plan: n/a

Reviewed By: zpao

fbshipit-source-id: 5b4edf4267942ab0cbd2980dc500227e3ce353e3
2019-12-09 13:02:10 -08:00
536481d9de Fix missing virtual destructor (#30927)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30927

Classes that are used virtually (e.g. have virtual methods) must have a virtual destructor or bad things happen
ghstack-source-id: 95144736

Test Plan: waitforsandcastle

Differential Revision: D18870351

fbshipit-source-id: 333af4e95469fdd9103aa9ef17b40cbc4a343f82
2019-12-09 12:25:26 -08:00
528fa737ba Custom op autograd tests (#30519)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30519

Re-enable them and write a few additional ones
ghstack-source-id: 95143051

Test Plan: unit tests

Differential Revision: D18729561

fbshipit-source-id: 8cefd8320913d72a450a3324bfd7c88faed072d7
2019-12-09 12:25:22 -08:00
daef363b15 Move Softshrink activation to Aten(CPU+CUDA) (#30229)
Summary:
VitalyFedyunin, This PR is about port Softshrink activation to Aten:
**Test script:**
```
import torch
import torch.nn as nn
import time

torch.manual_seed(0)
def _time():
    if torch.cuda.is_available():
        torch.cuda.synchronize()
    return time.time()

device = "cpu"
m = nn.Softshrink()
if torch.cuda.is_available():
    device = "cuda"
    m = m.cuda()

#warm up
for n in [100, 10000]:
    input = torch.randn(128, n, requires_grad=True, device=device)
    grad_output = torch.ones(128, n, device=device)
    for i in range(1000):
        output = m(input)
        output.backward(grad_output)

for n in [100, 10000]:
    input = torch.randn(128, n, requires_grad=True, device=device)
    grad_output = torch.ones(128, n, device=device)
    fwd_t = 0
    bwd_t = 0
    for i in range(10000):
        t1 = _time()
        output = m(input)
        t2 = _time()
        output.backward(grad_output)
        t3 = _time()
        fwd_t = fwd_t + (t2 -t1)
        bwd_t = bwd_t + (t3 - t2)
    fwd_avg = fwd_t / 10000 * 1000
    bwd_avg = bwd_t / 10000 * 1000
    print("input size(128, %d) forward time is %.2f (ms); backwad avg time is %.2f (ms)."
          % (n, fwd_avg, bwd_avg))
```
Test Device: CPU: skx-8180, GPU: Tesla P40.
Perfromance:
Before:
```
GPU:
input size(128, 100) forward time is 0.06 (ms); backwad avg time is 0.12 (ms).
input size(128, 10000) forward time is 0.06 (ms); backwad avg time is 0.18 (ms).
CPU:
input size(128, 100) forward time is 0.19 (ms); backwad avg time is 0.23 (ms).
input size(128, 10000) forward time is 17.23 (ms); backwad avg time is 16.83 (ms).
```
After:
```
GPU:
input size(128, 100) forward time is 0.05 (ms); backwad avg time is 0.11 (ms).
input size(128, 10000) forward time is 0.06 (ms); backwad avg time is 0.17 (ms).
CPU:
input size(128, 100) forward time is 0.08 (ms); backwad avg time is 0.05 (ms).
input size(128, 10000) forward time is 0.32 (ms); backwad avg time is 0.08 (ms).
```
`OMP_NUM_THREADS=1:`
```
Before:
input size(128, 100) forward time is 0.08 (ms); backwad avg time is 0.10 (ms).
input size(128, 10000) forward time is 7.58 (ms); backwad avg time is 7.91 (ms).
After:
input size(128, 100) forward time is 0.08 (ms); backwad avg time is 0.02 (ms).
input size(128, 10000) forward time is 7.30 (ms); backwad avg time is 1.02 (ms).
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30229

Differential Revision: D18810054

Pulled By: VitalyFedyunin

fbshipit-source-id: e19074824396570db45ba488ae4f9fe1b07a5839
2019-12-09 12:19:46 -08:00
4f342a61c1 add the worker IDs outside of addSendRpcBackward to ensure they are (#30914)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30914

When tensors don't require grad, we don't call `addSendRpcBackward`, where we record known workerIDs to clean up the dist autograd context later. But since  https://github.com/pytorch/pytorch/pull/29781, we always include the autograd context ID in RPCs, even if tensors do not require grad. So, it could be possible that we don't release the contexts on some nodes.

This can contribute to OOMs since the contexts will not be cleaned up in this case, which can be checking by running the unit test without this patch. We can fix this issue by moving the `addKnownWorkerIds`  call to the `getMessageWithAutograd` function.
ghstack-source-id: 95178561

Test Plan: Added a unit test: `test_context_cleanup_tensor_no_grad`

Differential Revision: D18869191

fbshipit-source-id: b80f66bfd0dd7d01960abe1691d3f44095bb1b2b
2019-12-09 11:38:34 -08:00
c75bc9067c MultiMarginCriterion: move scalar_check from codegen to code.
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/30827

Test Plan: Imported from OSS

Differential Revision: D18833658

Pulled By: gchanan

fbshipit-source-id: decd42789d92d4fbfeea9b470b3d7333e3862263
2019-12-09 07:48:58 -08:00
190dac13e3 Use universal references and perfect forwarding in Loops.h. (#30466)
Summary:
This simplifies the generated code a bit, saving about 40K off of libtorch.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30466

Differential Revision: D18836215

Pulled By: resistor

fbshipit-source-id: ad75c9e04783bb29cc06afd2022f73f9625dd52b
2019-12-08 23:31:10 -08:00
6848f9abb8 call fp16<->fp32 routines in fbgemm from Half2Float and Float2Half operators (#30715)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30715

Changed caffe2/caffe2/TARGETS file to define USE_FBGEMM for x86 and USE_SSE_ONLY is not defined.

Test Plan: buck test caffe2/caffe2:caffe2_test_cpu -- Float16

Reviewed By: jianyuh

Differential Revision: D18806067

fbshipit-source-id: 1b44b90a9f6dc3c27f81a46038c0f7542ed2bab3
2019-12-07 19:46:47 -08:00
776fdda753 Add debug info API for distributed autograd. (#30642)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30642

Adding a couple of basic metrics for distributed autograd which would
help in determining stuckness.
ghstack-source-id: 95156189

Test Plan: waitforbuildbot

Differential Revision: D18776478

fbshipit-source-id: a0556ad6fe2b7c3cd0082ee2350c1c78cafaaec5
2019-12-07 13:56:51 -08:00
0b33080992 Updating submodules
Summary:
GitHub commits:

452ebf30a8
8e85afc8a1
39d204760c
5760376392

Test Plan: n/a

Reviewed By: yns88

fbshipit-source-id: aa1ff805dbe1a1cbe5eb256ed2ba30af587a8707
2019-12-07 13:48:58 -08:00
4bb497b38e MultiheadAttention fixes
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/30666

Test Plan: Imported from OSS

Differential Revision: D18864094

Pulled By: pbelevich

fbshipit-source-id: f7a634b2c7f526282bf918d47b9cc82aa0c0af1d
2019-12-07 09:42:10 -08:00
8b6d7698d6 Updating submodules
Summary:
GitHub commits:

40ac0e57c1

Test Plan: n/a

Reviewed By: yns88

fbshipit-source-id: ac74c10651a5a4ef67c93a38dc6673f0687e38ae
2019-12-07 02:43:38 -08:00
f1bd8cc286 Fix lint issues in dist_autograd_test.py (#30928)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30928

ghstack-source-id: 95152373

Test Plan: waitforbuildbot

Differential Revision: D18872870

fbshipit-source-id: 2cd1ef228da4bd90c13e2f067a0c89b975fa3179
2019-12-07 01:44:37 -08:00
63f1b780ba Support exporting aten::copy_ and aten::index_put to ONNX opset 11 (#26941)
Summary:
- [x] Add more comments and refactor the logic of `ReshapeToAdvancedIndexingFormat`
- [x] Add more description here. Cases that are/aren't supported, and how they are supported.
- [x] Need to merge this PR https://github.com/pytorch/pytorch/issues/27186 to enable testing inplace operators.

We are now supporting exporting aten::copy_ and aten::index_put to ONNX.
Here's a breakdown of the different cases in PyTorch code.

```
# Case 1: Scalar Indices
x[0, 1, 2] = data

# Case 2: Slice Indices
x[1:3, :, ::2] = data

# Case 3: Ellipsis Indices
x[..., 0] = data

# Case 4: Tensor Indices
ind1 = torch.tensor([0, 2])
ind2 = torch.tensor([1, 1])
x[ind1, ind2] = data

# Case 5: Mixing all the above cases
ind1 = torch.tensor([0, 2])
ind2 = torch.tensor([1, 1])
x[1:3, ind1, ind2, ..., 3] = data
```

Limitations:

Tensor indices must be consecutive, and 1-d tensors.

```
# Supported
ind1 = torch.tensor([0, 2])
ind2 = torch.tensor([1, 1])
x[ind1, ind2] = data

# Not supported
ind1 = torch.tensor([0, 2])
ind2 = torch.tensor([1, 1])
ind3 = torch.tensor([[0], [1]])
x[ind1, :, ind2] = data
x[ind3] = data
```

Negative indices are not supported.
```
# Not supported
x[-1] = data
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/26941

Differential Revision: D17951030

Pulled By: houseroad

fbshipit-source-id: 4357777072f53aa0bc4b297aa1ee53457a7f8dec
2019-12-06 22:48:46 -08:00
a26238da57 Enable using torch.autograd.profiler.record_function as decorator (#30861)
Summary:
```python
record_function('my_func')
def f(x, y):
    return x + y

with profile() as p:
    f(1, 2)
print(prof.key_averages().table())
```

```
------------------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------
Name                                  Self CPU total %  Self CPU total   CPU total %      CPU total        CPU time avg     Number of Calls
------------------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------
my_func                               85.42%           86.796us         87.27%           88.670us         88.670us         1
------------------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------
Self CPU time total: 101.606us
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30861

Differential Revision: D18857993

Pulled By: bddppq

fbshipit-source-id: eb6b8e2a8d4f3a7f8e5b4cb3da1ee3320acb1ae7
2019-12-06 21:38:35 -08:00
5c56986738 Attach autograd edges only for tensors requiring grad. (#30904)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30904

When we sent tensors over RPC, on the server side we would call
addRecvRpcBackward which would call `set_history` on all tensors. This was
incorrect and set the `requires_grad` flag on tensors that didn't actually need
grad.

To fix this, we only attach autograd edges to tensors that need grads.
ghstack-source-id: 95113672
ghstack-source-id: 95113999

Test Plan: waitforbuildbot

Differential Revision: D18828561

fbshipit-source-id: d8942b76e9e4c567f8f1821f125c00d275ea0f90
2019-12-06 18:05:57 -08:00
62b10721fb Actually make flake8 do something (#30892)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30892

Fixes all outstanding lints and actually installs a properly configured
flake8

Test Plan: Imported from OSS

Differential Revision: D18862825

Pulled By: suo

fbshipit-source-id: 08e9083338a7309272e17bb803feaa42e348aa85
2019-12-06 17:50:50 -08:00
8d35b6cec7 embedding_bag make_bag_size optimization (#30701)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30701

From James' PR https://github.com/pytorch/pytorch/pull/19715

embedding_bag microbenchmarks:
Baseline: P123020983
Refactor make_bag_size, no changing at::zeros to at::empty (this diff): P123021393
Inference benchmark on T6_SKL - _embedding_bag self time only:
bs=40, baseline: .302 ms/iter
bs=40, with diff: .244 ms/iter
bs=1 baseline: .148 ms/iter
bs=1 with diff: .124 ms/iter
The bigger gap comes from fb::embedding_bag_byte_rowwise_offsets, I'm looking into that one too.

Test Plan:
MKL_NUM_THREADS=1 OMP_NUM_THREADS=1 numactl -m 0 -C 3 ./inference_benchmark_nolr_emb.par --pt-scripted-model=traced_model.pt --pt-inputs="batch_size_40/pt_inputs.pth" --iters=3000 --warmup-iters=100
buck run mode/opt //caffe2/benchmarks/operator_benchmark:benchmark_all_other_test -- --tag_filter all --iterations 3000 --operators embeddingbag

Reviewed By: yinghai, qizzzh

Differential Revision: D18800166

fbshipit-source-id: 820e6ece0b6ade72ee42409661f92c548f43a4cb
2019-12-06 16:17:16 -08:00
cd6167ff63 Upgrade bazel to 1.2.0. (#30885)
Summary:
Companion diff for https://github.com/pytorch/xla/pull/1464. Should land only after the pytorch/xla PR is in.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30885

Differential Revision: D18866835

Pulled By: ailzhang

fbshipit-source-id: 51f4d2770f8ef873a659579ddd81a42957ffb885
2019-12-06 16:08:24 -08:00
7b97eaeba5 Add module level qpl logging. (#30906)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30906

Add mobile module observer to measure performance of each method run.
ghstack-source-id: 95120194

Test Plan:
Run pytext model through BI cloaking flow on lite-interpreter and verify logs are sent:
1. buck install -r fb4a
2. Go to internal setting and find MobileConfig, search for android_bi_infra_cloaking_iab_models and set the following params:
a. sample_rate: 1.0
b. enabled: true
c. use_bytedoc_pytorch_model: true
d. use_bytedoc_caffe2_model: false
e. use_full_jit: false
3. Go back to new feed and scroll down until find an ads which will direct you to offsite webpage;
4. Click on the ads, wait for the offsite ads loads;
5. Click back to news feed;
6. Go to scuba table: https://fburl.com/scuba/4fghwp0b and see all the operator runs have been logged:

{F223456981}

Reviewed By: ljk53

Differential Revision: D18702116

fbshipit-source-id: a9f07eee684e3022cef5ba3c5934f30f20192a85
2019-12-06 15:52:26 -08:00
118f1c633b refactor the way we are handling bailout counts
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/30410

Differential Revision: D18733370

Pulled By: Krovatkin

fbshipit-source-id: 0ea9dc0f3dd1a47bcc09f1d54745460f9bd71886
2019-12-06 15:45:38 -08:00
c37de32b23 Enable len(dataloader) for iterable dataset (#23587)
Summary:
Copy-paste comment from code for reasoning:

```
            # NOTE [ IterableDataset and __len__ ]
            #
            # For `IterableDataset`, `__len__` could be inaccurate when one naively
            # does multi-processing data loading, since the samples will be duplicated.
            # However, no real use case should be actually using that behavior, so
            # it should count as a user error. We should generally trust user
            # code to do the proper thing (e.g., configure each replica differently
            # in `__iter__`), and give us the correct `__len__` if they choose to
            # implement it (this will still throw if the dataset does not implement
            # a `__len__`).
            #
            # To provide a further warning, we track if `__len__` was called on the
            # `DataLoader`, save the returned value in `self._len_called`, and warn
            # if the iterator ends up yielding more than this number of samples.
```

Fixes https://github.com/pytorch/pytorch/issues/30184
Pull Request resolved: https://github.com/pytorch/pytorch/pull/23587

Differential Revision: D18852625

Pulled By: ailzhang

fbshipit-source-id: aea8d4d70c7f21aaa69b35908a6f43026493d826
2019-12-06 15:38:05 -08:00
a77eafa1d8 Fix 'initialized after field' error (#30908)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30908

Same as title.

Test Plan: Wait for CI to clear.

Reviewed By: bddppq, xw285cornell

Differential Revision: D18862837

fbshipit-source-id: bc34356b85774fc20ba46d321c8a2bb5d5c727f6
2019-12-06 15:04:18 -08:00
baccd26df7 update code analyzer script to handle splitted torch libraries (#30864)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30864

Change it to handle all archive files under install folder.

Test Plan:
```
ANALYZE_TEST=1 CHECK_RESULT=1 tools/code_analyzer/build.sh
ANALYZE_TORCH=1 tools/code_analyzer/build.sh
```

Differential Revision: D18850317

Pulled By: ljk53

fbshipit-source-id: 7c57ae16c82b6ded53aa7df385f3b6074190fc04
2019-12-06 14:38:30 -08:00
223f46f5fa Fix flake8 warning (#30905)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30905

-
ghstack-source-id: 95117983

Test Plan: -

Differential Revision: D18861981

fbshipit-source-id: b794a7fbe05af29471286c7f665cf3f86541eb5a
2019-12-06 14:19:35 -08:00
4fd20c0816 Kill hypothesis deadline testing (#30890)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30890

We've received way too many complaints about this functionality making tests flaky, and it's not providing value to us anyway. Let's cut the shit and kill deadline testing

Test Plan: Imported from OSS

Differential Revision: D18857597

Pulled By: jamesr66a

fbshipit-source-id: 67e3412795ef2fb7b7ee896169651084e434d2f6
2019-12-06 13:36:14 -08:00
26c51468c5 Fix examples in RRef API doc
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/30857

Test Plan: Imported from OSS

Differential Revision: D18847527

Pulled By: mrshenli

fbshipit-source-id: 7dc9d28277597f8fc3ef97fa9ac98a312e76e6fb
2019-12-06 13:14:11 -08:00
642469b706 Fix examples in API doc
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/30856

Test Plan: Imported from OSS

Differential Revision: D18847528

Pulled By: mrshenli

fbshipit-source-id: 57f666d9d4b634fb77b1b65debd2b07e2bebd57a
2019-12-06 13:14:06 -08:00
5e6c3fb23b Add more details to explain rpc_backend_options arg in init_rpc
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/30855

Test Plan: Imported from OSS

Differential Revision: D18847529

Pulled By: mrshenli

fbshipit-source-id: b4f0d5797f3b41cce155b7821d6bd34b268bd24e
2019-12-06 13:14:02 -08:00
6d06b925ba Remove values_to_quantize_ (#30858)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30858

This is not needed since we have `values_to_qparams_`

Test Plan:
.

Imported from OSS

Differential Revision: D18848992

fbshipit-source-id: dc81f59967a93abdd5562f1010f02de4f4e60db0
2019-12-06 12:15:13 -08:00
81e4739141 Move QScheme ops to c10 (#30134)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30134

ghstack-source-id: 95055387

Test Plan: buck build mode/dev caffe2:generate-code

Differential Revision: D18609716

fbshipit-source-id: fec39359e0b97387a9b13f8179d72a731cc61808
2019-12-06 12:04:51 -08:00
d6ddfab11f save linux build binary size to Scuba (#30832)
Summary:
example: https://fburl.com/scuba/mjheume7
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30832

Differential Revision: D18857146

Pulled By: mingbowan

fbshipit-source-id: 66bcd352922944c227f337a66e8a75e2d7393fd3
2019-12-06 11:55:35 -08:00
78254eab45 Add mobile operator observer for qpl logging.
Summary: Add mobile operator observer to measure performance of each operator run, the result will also log into QPL event: [MOBILE_OPERATOR_STATS ](https://fburl.com/quicklog/8773a00a).

Test Plan:
Run pytext model through BI cloaking flow on lite-interpreter and verify logs are sent:
1. buck install -r fb4a
2. Go to internal setting and find MobileConfig, search for android_bi_infra_cloaking_iab_models and set the following params:
a. sample_rate: 1.0
b. enabled: true
c. use_bytedoc_pytorch_model: true
d. use_bytedoc_caffe2_model: false
e. use_full_jit: false
3. Go back to new feed and scroll down until find an ads which will direct you to offsite webpage;
4. Click on the ads, wait for the offsite ads loads;
5. Click back to news feed;
6. Go to scuba table: https://fburl.com/scuba/er7t4g9u and see all the operator runs have been logged:

{F223250762}

Reviewed By: ljk53

Differential Revision: D18131224

fbshipit-source-id: 23e2f6e2a9851c04b29511b45dc53f3cce03e8a0
2019-12-06 11:55:32 -08:00
44ff7b08d8 Reduce intrusive_ptr incref/decref costs (#30709)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30709

Intrusive_ptr doesn't provide a explicit incref method. When a users want to
incref the target, they creates a intrusive_ptr to wrap the target, then makes
a copy which does the actual incref, then release both the first intrusive_ptr
and the copy to prevent decref at deconstruction time. This is very
inefficient. Instead, do the incref/decref directly.

Differential Revision: D18798505

fbshipit-source-id: 524d4f30d07d733df09d54423b044d80e4651454
2019-12-06 11:52:20 -08:00
e123d90a93 Back out "Back out "Back out "Revert D18542342: Boxed variable dispatch""" (#30650)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30650

Original commit changeset: 51bb7aac7cb7
ghstack-source-id: 95082205

Test Plan: CI

Differential Revision: D18778190

fbshipit-source-id: 7e9577e88fd0492006b6ea836ec081aea9da6b0c
2019-12-06 11:45:09 -08:00
37435d36ed Refactor VariableTypeManual (#30649)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30649

Operators in VariableTypeManual are now no longer registered against the VariableTypeId key, but they are registered as compound ops. See https://github.com/pytorch/pytorch/issues/30102 for background.

This also requires the non-variable codegen to ignore them and requires removal of VariableMethodStubs.cpp.

So, because function_wrapper.py now also needs to know which ops are manual, instead of having a hard-coded list in gen_variable_type.cpp for ops with manual implementation, we now have a `manual_kernel_registration` flag in native_functions.yaml that disables the registration of operator kernels for this operator (the schema is still registered). Then, we manually register the right kernels for the operator.
ghstack-source-id: 95082204

Test Plan: unit tests

Differential Revision: D18778191

fbshipit-source-id: 0af6f9e43ff4fb9800ce19b286dfccd0fd22cc41
2019-12-06 11:45:05 -08:00
b0e7db5b31 Revert D18840736: make sure windows tests get triggered
Test Plan: revert-hammer

Differential Revision:
D18840736

Original commit changeset: 6fdf73649622

fbshipit-source-id: 719576e9c717847bfb4b057875a273123e941db3
2019-12-06 11:26:37 -08:00
4ed2eae2d0 Add registerQParams function (#30552)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30552

For upcoming changes to support quantizing shared class type

Test Plan:
.

Imported from OSS

Differential Revision: D18818653

fbshipit-source-id: 393a55db69b20a1c00ffa0157ab568cb097915b2
2019-12-06 11:17:35 -08:00
0051467118 Update CITATION from Workshop paper to Conference paper (#30872)
Summary:
The conference paper is finally published at NeurIPS 2019: http://papers.neurips.cc/paper/9015-pytorch-an-imperative-style-high-performance-deep-learning-library.pdf
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30872

Differential Revision: D18854253

Pulled By: soumith

fbshipit-source-id: 4f91838b1953e976542997959d5571884f739872
2019-12-06 09:16:17 -08:00
377131b0eb MultiMarginCriterion: fix scalar_check in the case where reduction == None. (#30826)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30826

Previously the scalar_check for the reduction None case was:
input.dim() <= 1, but it should be target based, i.e.:
target.dim() == 0.  This follows from the "correct cases", i.e.
(N, C) X (N,) -> (N,)
(C,) X () -> ()

Test Plan: Imported from OSS

Differential Revision: D18833660

Pulled By: gchanan

fbshipit-source-id: 26338b842a8311718c4b89da3e2f1b726d5409b8
2019-12-06 09:04:38 -08:00
5687ee1d85 added a serialize function in SGD class to utilize the existing macro for serialization/deserialization calls
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/30739

Differential Revision: D18842908

Pulled By: anjali411

fbshipit-source-id: 7dc13ff9c4fc126790b88b1b4b5d03425c349d38
2019-12-06 08:38:07 -08:00
e5d571ae25 Remove scalar_check from topk, move it to the THC implementation.
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/30852

Test Plan: Imported from OSS

Differential Revision: D18842662

Pulled By: gchanan

fbshipit-source-id: b5e8a4367fce9441be2ddbd026495f1911038221
2019-12-06 07:50:20 -08:00
60714dfb64 change index_select scalar_check to retain dimensionality of input. (#30790)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30790

The index_select documentaiton reads:
"The returned tensor has the same number of dimensions as the original tensor (input)."

But the implementation would return a 0-dimensional tensor iff both the input and index were 0-dimensional.
This change makes it so we retuan a 0-dimensional tensor iff the input is 0-dimensional.

Restacked version of: https://github.com/pytorch/pytorch/pull/30502

Test Plan: Imported from OSS

Differential Revision: D18825717

Pulled By: gchanan

fbshipit-source-id: aeb10c5107e748af3e264fbdc81fff5dd4833cc4
2019-12-06 07:47:53 -08:00
1d7b40f1c4 Fix reading __cuda_array_interface__ without strides (#24947)
Summary:
When converting a contiguous CuPy ndarray to Tensor via `__cuda_array_interface__`, an error occurs due to incorrect handling of default strides. This PR fixes this problem. It makes `torch.tensor(cupy_ndarray)` works for contiguous inputs.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/24947

Differential Revision: D18838986

Pulled By: ezyang

fbshipit-source-id: 2d827578f54ea22836037fe9ea8735b99f2efb42
2019-12-06 07:36:27 -08:00
11b3065323 Run method_tests on CUDA. (#30821)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30821

While investigating while our tests didn't catch #30704 I noticed that none
of our tests in method_tests() were being run on CUDA.  This diff moves
those tests into the new device-generic test framework so that we also get
CUDA coverage.  For expediency, I blacklisted all tests which didn't work
on CUDA (rather than fix them); that's something we can leave for future PRs.
This is done by way of a new expectedFailure gadget.

Note that all occurences of skipIfNoLapack needed to be replaced with
skipCPUIfNoLapack.

I punted for test_jit; it's possible those tests should also run CUDA but a JIT
expert should take a look here.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Differential Revision: D18840089

Pulled By: ezyang

fbshipit-source-id: 66b613b5024c91d3e391c456bb642be7e73d4785
2019-12-06 07:24:27 -08:00
9a858aba5f Moving checks related to options.aliasAnalysis and schema.hasAliasInfo to read callsite (#30671)
Summary:
**Context:**
In D18530964, we allow not set aliasAnalysis at previous registration call, and then update it to the correct one in following registration call.

But its not working E2E due to those existing checks.

So we want to remove or delay those TORCH_CHECKs.

Here is the existing three callsites for operator.aliasAnalysisKind():
https://our.intern.facebook.com/intern/diffusion/FBS/browse/master/fbcode/caffe2/torch/csrc/jit/ir.cpp?lines=994%2C995%2C996%2C1001%2C1004

https://our.intern.facebook.com/intern/diffusion/FBS/browse/master/fbcode/caffe2/torch/csrc/jit/operator.cpp?lines=147%2C155

https://our.intern.facebook.com/intern/diffusion/FBS/browse/master/fbcode/caffe2/torch/csrc/jit/passes/alias_analysis.cpp?lines=260%2C277%2C380

**Things to check**
1. Those two checks are different. But since in original op_registration code, if options.schemaOrName_->is_right() is FALSE, we kind of convert it to FunctionSchema type, so in the read callsites, we only need to check the following: options.aliasAnalysisKind_ == AliasAnalysisKind::FROM_SCHEMA ||  !schema.hasAnyAliasInfo()

2. If the three callsites above are indeed needed for those checks.

3. Here we made assumptions that for reads from jit or other places, its always being called after all registrations calls are done. Trying to make sure its a valid assumption
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30671

Test Plan: Will update and refactor the tests soon.

Differential Revision: D18784623

Pulled By: charliechen0401

fbshipit-source-id: 75edea140d0ae3e54820e1aeef010c81fe26416a
2019-12-06 01:36:22 -08:00
619e2ffe23 Replace deprecated AT_* with TORCH_* to reduce warnings in c10d
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/30795

Test Plan: Imported from OSS

Differential Revision: D18826310

Pulled By: mrshenli

fbshipit-source-id: 0041ac2e5788e874e0a566abd57a8a90e658da9b
2019-12-06 01:28:30 -08:00
b0cba8ceae Replace deprecated AT_ERROR with TORCH_CHECK to reduce warnings in rpc
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/30794

Test Plan: Imported from OSS

Differential Revision: D18826311

Pulled By: mrshenli

fbshipit-source-id: bfd58d30f386bbe9535264b2afce4acbe7ac5b0e
2019-12-06 01:28:26 -08:00
2011cc1e91 Fix half->float case of softmax backward when inner_size is not 1 (#30838)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/30572

That unit test is tested to fail with master and success with this PR.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30838

Differential Revision: D18841066

Pulled By: ngimel

fbshipit-source-id: 86a7ccdb3016c98d62dd0946daff101704cd1f68
2019-12-06 00:25:34 -08:00
d32aec5ad6 Add get_metrics and get_debug_info to rpc agent (#30833)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30833

[rpc] Add get_metrics and get_debug_info to rpc agent

Test Plan: UT and builds

Reviewed By: mrshenli

Differential Revision: D18835068

fbshipit-source-id: f552cf196bb6d54ccd38a44ba981e7d5b15513f0
2019-12-05 23:52:42 -08:00
58cdf1429c Add tests for quantizing traced models (#30476)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30476

att

Test Plan:
python test/test_quantization.py

Imported from OSS

Differential Revision: D18795724

fbshipit-source-id: 9253e102bf458d9185f68848071a4e4eff9f9b08
2019-12-05 23:03:45 -08:00
f1755d9aea Insert GetAttr for quantization parameters instead of Constant (#30551)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30551

To enable quantizing with shared types, we need to insert GetAttr nodes for
quantization parameters since the code might be shared by multiple module instances
and we'd like to make quantized module instance also share the same code but with
different values of attributes.

Test Plan:
test_jit.py, test_quantization.py

Imported from OSS

Differential Revision: D18818652

fbshipit-source-id: fc95623cac59dcedd9e3f95397524eae515e7a11
2019-12-05 22:52:45 -08:00
1fa4908ac0 Refactor test_quantization.py and enable test_nested (#30475)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30475

att

Test Plan:
python test/test_quantization.py

Imported from OSS

Differential Revision: D18795727

fbshipit-source-id: c9942c5361e0a34e91a08b8fc27405799db7ff4f
2019-12-05 21:56:03 -08:00
ef95a72690 modify test_local_shutdown_with_rpc to not be flaky (#30837)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30837

This test would get very occasional flakes, with an error saying the
RPC timed out. This happened because one worker would still be waiting for the
return value of an RPC, but another worker had already performed its local
shutdown, so it would not have sent the response. This didn't show up in
initial testing since the flakiness is very rare (< 1/100 test runs). This diff
fixes the issue by not erroring if these RPCs timeout. The reason this is okay
is because with a local shutdown, we should not expect for all outstanding RPCs
to be completed, since workers are free to shut down without completing/waiting
on outstanding work.
ghstack-source-id: 95021672
ghstack-source-id: 95021672

Test Plan: Ran the test 1000 times to ensure that it is not flaky.

Differential Revision: D18775731

fbshipit-source-id: 21074e8b4b4bbab2be7b0a59e80cb31bb471ea46
2019-12-05 21:46:39 -08:00
7af9d77290 Update persons_of_interest.rst
Updating to add POI for mobile, quantization and an addition to optimizers.
2019-12-05 21:20:40 -08:00
a7406516d1 Refactor bias and weight check and add aten::linear pattern (#30474)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30474

There are some common parts in `isBiasOfConvOrLinear` and `isWeightOfConvOrLinear`, we can factor
them out, the refactor will allow for easier extension of new patterns

Test Plan:
python test/test_jit.py
python test/test_quantization.py

Imported from OSS

Differential Revision: D18795725

fbshipit-source-id: 446463da5e3fa8464db441ed0d9651930487b3b7
2019-12-05 21:00:39 -08:00
a51c5f5cbf Add JIT pass to insert permutes for conv ops (#30679)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30679

Caffe2 expects quantized ops to be in NHWC format while pytorch inputs are in NCHW.
Add a jit pass to insert permutes to convert from nchw2nhwc before each conv op and add nhwc2nchw permute after the conv op.
Using graph rewriter to find consecutive redundant permutes and remove them from the graph

Test Plan:
python test/onnx/test_pytorch_onnx_caffe2_quantized.py TestQuantizedOps

Imported from OSS

Differential Revision: D18790518

fbshipit-source-id: 4dd39cf0b31b21f5586c0edfdce2260d4e245112
2019-12-05 18:51:16 -08:00
c1159494a6 Revert D18621773: we should have a config-based way to skip flaky tests
Test Plan: revert-hammer

Differential Revision:
D18621773

Original commit changeset: 5532f1d5fa3f

fbshipit-source-id: 22239b88a6f9551938e6e2178bf9162e3385b011
2019-12-05 17:08:20 -08:00
4034aa7621 make sure windows tests get triggered (#30836)
Summary:
we prefer "_" over "-" in build names, so change checks in test script
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30836

Differential Revision: D18840736

Pulled By: mingbowan

fbshipit-source-id: 6fdf736496225c5f8ab44906d8f4681b7bf894a7
2019-12-05 15:47:56 -08:00
82c3f4861f Move hardtanh activation to Aten(CPU, CUDA) (#30152)
Summary:
VitalyFedyunin, This PR is about port Hardtanh activation to Aten:
**Test script:**
```
import torch
import torch.nn as nn
import time

torch.manual_seed(0)
def _time():
    if torch.cuda.is_available():
        torch.cuda.synchronize()
    return time.time()

device = "cpu"
m = nn.Hardtanh()
if torch.cuda.is_available():
    device = "cuda"
    m = m.cuda()

#warm up
for n in [100, 10000]:
    input = torch.randn(128, n, requires_grad=True, device=device)
    grad_output = torch.ones(128, n, device=device)
    for i in range(1000):
        output = m(input)
        output.backward(grad_output)

for n in [100, 10000]:
    input = torch.randn(128, n, requires_grad=True, device=device)
    grad_output = torch.ones(128, n, device=device)
    fwd_t = 0
    bwd_t = 0
    for i in range(10000):
        t1 = _time()
        output = m(input)
        t2 = _time()
        output.backward(grad_output)
        t3 = _time()
        fwd_t = fwd_t + (t2 -t1)
        bwd_t = bwd_t + (t3 - t2)
    fwd_avg = fwd_t / 10000 * 1000
    bwd_avg = bwd_t / 10000 * 1000
    print("input size(128, %d) forward time is %.2f (ms); backwad avg time is %.2f (ms)."
          % (n, fwd_avg, bwd_avg))
```
Test Device: CPU: skx-8180, GPU: Tesla P40.
Perfromance:
Before:
```
GPU:
input size(128, 100) forward time is 0.05 (ms); backwad avg time is 0.11 (ms).
input size(128, 10000) forward time is 0.06 (ms); backwad avg time is 0.17 (ms).
CPU
input size(128, 100) forward time is 0.02 (ms); backwad avg time is 0.06 (ms).
input size(128, 10000) forward time is 0.84 (ms); backwad avg time is 0.44 (ms).
```
After:
```
GPU:
input size(128, 100) forward time is 0.05 (ms); backwad avg time is 0.11 (ms).
input size(128, 10000) forward time is 0.06 (ms); backwad avg time is 0.17 (ms).
CPU
input size(128, 100) forward time is 0.02 (ms); backwad avg time is 0.05 (ms).
input size(128, 10000) forward time is 0.61 (ms); backwad avg time is 0.10 (ms).
```
`OMP_NUM_THREADS=1:`
```
Before:
input size(128, 100) forward time is 0.05 (ms); backwad avg time is 0.07 (ms).
input size(128, 10000) forward time is 5.21 (ms); backwad avg time is 5.25 (ms).
After:
input size(128, 100) forward time is 0.01 (ms); backwad avg time is 0.02 (ms).
input size(128, 10000) forward time is 1.09 (ms); backwad avg time is 1.09 (ms).
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30152

Differential Revision: D18815545

Pulled By: VitalyFedyunin

fbshipit-source-id: d23b6b340a7276457f22dce826bcbe3b341d755f
2019-12-05 15:28:03 -08:00
6e38d50352 Revert D18117070: Migrate max and min (binary) from TH to ATen.
Test Plan: revert-hammer

Differential Revision:
D18117070

Original commit changeset: e06d37a8a140

fbshipit-source-id: 49dd33f52e7e3ffcaafc02109a0a0a67545ec7e8
2019-12-05 14:43:29 -08:00
e5bd7a7942 we should have a config-based way to skip flaky tests (#29944)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/29944

This particular approach queries our issue tracker for test titles that
match the following format:

```
DISABLED test_async_grad_guard_with_grad (jit.test_async.TestAsync)
```

And then skips the python test for them. There is 1 second timeout so
if the internet flakes we still run the test suite, without disabling any
tests.

This is intended as a quick fix, similar to ninja unland, to get to a green
master. Long term test disables should go into the code.

Test Plan: Imported from OSS

Differential Revision: D18621773

Pulled By: zdevito

fbshipit-source-id: 5532f1d5fa3f83f77fc3597126cbb7dba09a3c33
2019-12-05 14:28:27 -08:00
0974dcc244 Fix error checking of CUDA multi_margin_loss. (#30825)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30825

It didn't verify in the 1-d case that the targets were size 1..

Test Plan: Imported from OSS

Differential Revision: D18833659

Pulled By: gchanan

fbshipit-source-id: 9b0276e7b0423fdaf2ba7cfa34bde541558c61f9
2019-12-05 14:23:00 -08:00
2ced81f289 Revert "Default to not build Caffe2 operators on Windows. (#29061)" (#30740)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30740

This reverts commit 7102aceaf88ab71781c6019458bd7a07e86a532f.

Test Plan: Imported from OSS

Differential Revision: D18834315

Pulled By: ezyang

fbshipit-source-id: 2dbd1cf686864b9840365083182cd6188a285399
2019-12-05 14:01:59 -08:00
f874230d33 Vectorize smooth L1 loss backward function on CPU. (#30046)
Summary:
Benchmark (Intel i7-8850H, turbo off, release build, RHEL 7.7):

```
import timeit

for dtype in ('torch.float', 'torch.double'):
    print(f'dtype={dtype}')
    for n, t in [(10_000, 100000),
                (100_000, 20000)]:
        print(f'numel() == {n} for {t} times')
        print(timeit.timeit('output.backward(retain_graph=True)', number=t, setup=f"""
import torch
loss = torch.nn.SmoothL1Loss()
input = torch.randn({n}, requires_grad=True)
target = torch.randn({n})
output = loss(input, target)
"""))
```

Before:

```
dtype=torch.float
numel() == 10000 for 100000 times
6.154701935998673
numel() == 100000 for 20000 times
5.157296671999575
dtype=torch.double
numel() == 10000 for 100000 times
6.195317157000318
numel() == 100000 for 20000 times
5.099748799999361
```

After:

```
dtype=torch.float
numel() == 10000 for 100000 times
4.968745516000126
numel() == 100000 for 20000 times
2.4029395039997326
dtype=torch.double
numel() == 10000 for 100000 times
4.9910852479988534
numel() == 100000 for 20000 times
2.4867371629989066
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30046

Differential Revision: D18602399

Pulled By: VitalyFedyunin

fbshipit-source-id: 4c6c7b7b69ad6bce759786ddd7d6bc1e88ecf6ab
2019-12-05 13:57:42 -08:00
6486bdfb90 Fix os.register_at_fork not defined on Windows (#30809)
Summary:
According to https://docs.python.org/3.8/library/os.html#os.register_at_fork, this function is only available in Unix platforms.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30809

Differential Revision: D18828777

Pulled By: bddppq

fbshipit-source-id: 3325a984da488bb0a80a5c27131553fbcf78921f
2019-12-05 13:36:53 -08:00
c564d794ed Add ATen/native/ headers to torch target (#30835)
Summary:
We dont have ATen/native/*.h in torch target before, and we would like it to be exposed for external use.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30835

Differential Revision: D18836160

Pulled By: zrphercule

fbshipit-source-id: 7330a9c9d8b65f173cc332b1cfeeb18c7dca20a8
2019-12-05 13:24:21 -08:00
244b0bd1a5 Add docs for how we expose declarations in at:: to torch:: (#30760)
Summary:
This PR adds docs for how we expose declarations in `at::` to `torch::`, to make the semantics more clear.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30760

Differential Revision: D18833081

Pulled By: yf225

fbshipit-source-id: eff4d8815c67f681ce3a930ce99771cf2e55dbd9
2019-12-05 13:05:28 -08:00
be55874f2c style fixes to code analyzer (#30808)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30808

Addressed some comments on #29550 after it's landed.

Test Plan:
```
LLVM_DIR=... ANALYZE_TEST=1 CHECK_RESULT=1 tools/code_analyzer/build.sh
LLVM_DIR=... ANALYZE_TORCH=1 tools/code_analyzer/build.sh -closure=false -debug_path=true
```

Differential Revision: D18835100

Pulled By: ljk53

fbshipit-source-id: 991d292ddc0211a88b04d0bdc24719f471c7786e
2019-12-05 11:25:37 -08:00
9617d07bd5 Wrap warning handler in a function to avoid siof (#30800)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30800

SparseNN benchmark crashed due to this.
Wrap warning handler in a function to avoid siof.

Test Plan: Tested locally, SparseNN benchmark no longer crashes.

Reviewed By: yinghai

Differential Revision: D18826731

fbshipit-source-id: 8fcab8a3f38cc20f775409c0686363af3c27d0a6
2019-12-05 11:22:15 -08:00
bf1b4b6fef add torch_cpu to the static library list in TorchConfig.cmake.in (#30769)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30769

The TorchConfig.cmake is the public cmake we produce in install folder for
3rd party client code to get all libtorch dependencies easily.

Apparently this build flow is not well covered by our CI (which is focused
on 1st party build / shared libraries?) as the little dummy project for
code analysis testing purpose was broken by #30315 without fail any CI.

Fixed the problem for mobile build and add the dummy project build to mobile
CI as well.

Test Plan: - make sure new CI pass;

Differential Revision: D18825054

Pulled By: ljk53

fbshipit-source-id: 80506f3875ffbc1a191154bb9e3621c621e08b12
2019-12-05 11:13:32 -08:00
f531815526 Deprecate tensor.type() (#30281)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/29161.

I looked a bit at the code changes related to this and think I have all of the use cases of `DeprecatedTypeProperties` covered in the message, but suggestions from someone with more context on this would be very much appreciated :)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30281

Differential Revision: D18830818

Pulled By: ezyang

fbshipit-source-id: 1a7fcee15354ae09e6644577e7fa33bd26acfe20
2019-12-05 10:55:34 -08:00
2171f91053 reenable cuda_kernel_loop_overflow_large test (#30797)
Summary:
Fix https://github.com/pytorch/pytorch/issues/30771 has landed, original issue https://github.com/pytorch/pytorch/issues/26838 is now closed

cc peterjc123
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30797

Differential Revision: D18827307

Pulled By: ngimel

fbshipit-source-id: 41b3db5fc9db85daeaa1b53c55b468976c996285
2019-12-05 10:09:39 -08:00
1578a28692 Migrate max and min (binary) from TH to ATen. (#27185)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/27185

TH implementation will be removed after the unary max and min are migrated.

Benchmark: (Debian 10, Release build, gcc 7.4, no turbo)

```python
import timeit
for device in ('cpu', 'cuda'):
    print(f'device: {device}')
    for op in ('max', 'min'):
        for dtype in ('torch.double', 'torch.float', 'torch.int16', 'torch.int32', 'torch.int64'):
            for n, t in [(10_000, 200000),
                        (100_000, 20000)]:
                print(f'torch.{op}(a, b), numel() == {n} for {t} times, dtype={dtype}')
                print(timeit.timeit(f'torch.{op}(a)' + (';torch.cuda.synchronize()' if device == 'cuda' else ''),
                                    setup=f'import torch; a = torch.arange({n}, dtype={dtype}); b = torch.ones({n}, 0, dtype={dtype}) * ({n} / 2)', number=t))
    print()
```

Before:

```
device: cpu
torch.max(a, b), numel() == 10000 for 200000 times, dtype=torch.double
2.241763713000182
torch.max(a, b), numel() == 100000 for 20000 times, dtype=torch.double
1.7138833169992722
torch.max(a, b), numel() == 10000 for 200000 times, dtype=torch.float
2.2183356810000987
torch.max(a, b), numel() == 100000 for 20000 times, dtype=torch.float
1.7031846980007685
torch.max(a, b), numel() == 10000 for 200000 times, dtype=torch.int16
1.7704679510006827
torch.max(a, b), numel() == 100000 for 20000 times, dtype=torch.int16
1.289198366999699
torch.max(a, b), numel() == 10000 for 200000 times, dtype=torch.int32
1.7937613740014058
torch.max(a, b), numel() == 100000 for 20000 times, dtype=torch.int32
1.2930124340000475
torch.max(a, b), numel() == 10000 for 200000 times, dtype=torch.int64
1.8032857640009752
torch.max(a, b), numel() == 100000 for 20000 times, dtype=torch.int64
1.2908709189996443
torch.min(a, b), numel() == 10000 for 200000 times, dtype=torch.double
1.8829010000008566
torch.min(a, b), numel() == 100000 for 20000 times, dtype=torch.double
1.2994690759987861
torch.min(a, b), numel() == 10000 for 200000 times, dtype=torch.float
1.8037853410005482
torch.min(a, b), numel() == 100000 for 20000 times, dtype=torch.float
1.2929310759991495
torch.min(a, b), numel() == 10000 for 200000 times, dtype=torch.int16
1.8075240359994496
torch.min(a, b), numel() == 100000 for 20000 times, dtype=torch.int16
1.2932477679987642
torch.min(a, b), numel() == 10000 for 200000 times, dtype=torch.int32
1.7868400779989315
torch.min(a, b), numel() == 100000 for 20000 times, dtype=torch.int32
1.2885970789993735
torch.min(a, b), numel() == 10000 for 200000 times, dtype=torch.int64
1.8389664830010588
torch.min(a, b), numel() == 100000 for 20000 times, dtype=torch.int64
1.29402057399966

device: cuda
torch.max(a, b), numel() == 10000 for 200000 times, dtype=torch.double
4.787109836999662
torch.max(a, b), numel() == 100000 for 20000 times, dtype=torch.double
1.842438002999188
torch.max(a, b), numel() == 10000 for 200000 times, dtype=torch.float
3.429616614999759
torch.max(a, b), numel() == 100000 for 20000 times, dtype=torch.float
1.835390076999829
torch.max(a, b), numel() == 10000 for 200000 times, dtype=torch.int16
2.940423873000327
torch.max(a, b), numel() == 100000 for 20000 times, dtype=torch.int16
1.4108991760003846
torch.max(a, b), numel() == 10000 for 200000 times, dtype=torch.int32
2.9318018840003788
torch.max(a, b), numel() == 100000 for 20000 times, dtype=torch.int32
1.4168134739993548
torch.max(a, b), numel() == 10000 for 200000 times, dtype=torch.int64
2.9610764919998473
torch.max(a, b), numel() == 100000 for 20000 times, dtype=torch.int64
1.4189234130008117
torch.min(a, b), numel() == 10000 for 200000 times, dtype=torch.double
2.960172712999338
torch.min(a, b), numel() == 100000 for 20000 times, dtype=torch.double
1.4162539499993727
torch.min(a, b), numel() == 10000 for 200000 times, dtype=torch.float
2.8985912560001452
torch.min(a, b), numel() == 100000 for 20000 times, dtype=torch.float
1.4113489299998037
torch.min(a, b), numel() == 10000 for 200000 times, dtype=torch.int16
2.9160250799995993
torch.min(a, b), numel() == 100000 for 20000 times, dtype=torch.int16
1.4128787690005993
torch.min(a, b), numel() == 10000 for 200000 times, dtype=torch.int32
2.8806865219994506
torch.min(a, b), numel() == 100000 for 20000 times, dtype=torch.int32
1.4086357010000938
torch.min(a, b), numel() == 10000 for 200000 times, dtype=torch.int64
2.9362181240012433
torch.min(a, b), numel() == 100000 for 20000 times, dtype=torch.int64
1.4151225870009512

```

After:

```
device: cpu
torch.max(a, b), numel() == 10000 for 200000 times, dtype=torch.double
2.2685823729998447
torch.max(a, b), numel() == 100000 for 20000 times, dtype=torch.double
1.72004808300062
torch.max(a, b), numel() == 10000 for 200000 times, dtype=torch.float
2.212242640000113
torch.max(a, b), numel() == 100000 for 20000 times, dtype=torch.float
1.7089235590001408
torch.max(a, b), numel() == 10000 for 200000 times, dtype=torch.int16
1.7767087259999244
torch.max(a, b), numel() == 100000 for 20000 times, dtype=torch.int16
1.2916517639996528
torch.max(a, b), numel() == 10000 for 200000 times, dtype=torch.int32
1.8265984959998605
torch.max(a, b), numel() == 100000 for 20000 times, dtype=torch.int32
1.3002885240002797
torch.max(a, b), numel() == 10000 for 200000 times, dtype=torch.int64
1.8084679720004715
torch.max(a, b), numel() == 100000 for 20000 times, dtype=torch.int64
1.3012119999993956
torch.min(a, b), numel() == 10000 for 200000 times, dtype=torch.double
1.8800218449996464
torch.min(a, b), numel() == 100000 for 20000 times, dtype=torch.double
1.3060645710002063
torch.min(a, b), numel() == 10000 for 200000 times, dtype=torch.float
2.4905043950002437
torch.min(a, b), numel() == 100000 for 20000 times, dtype=torch.float
1.9126290209997023
torch.min(a, b), numel() == 10000 for 200000 times, dtype=torch.int16
1.7972335520007618
torch.min(a, b), numel() == 100000 for 20000 times, dtype=torch.int16
1.2918074379995232
torch.min(a, b), numel() == 10000 for 200000 times, dtype=torch.int32
1.8047651860006226
torch.min(a, b), numel() == 100000 for 20000 times, dtype=torch.int32
1.2992197730000044
torch.min(a, b), numel() == 10000 for 200000 times, dtype=torch.int64
1.8526509560006161
torch.min(a, b), numel() == 100000 for 20000 times, dtype=torch.int64
1.3030709570002728

device: cuda
torch.max(a, b), numel() == 10000 for 200000 times, dtype=torch.double
4.700986622000528
torch.max(a, b), numel() == 100000 for 20000 times, dtype=torch.double
1.8415469050005413
torch.max(a, b), numel() == 10000 for 200000 times, dtype=torch.float
3.3051693249999516
torch.max(a, b), numel() == 100000 for 20000 times, dtype=torch.float
1.8321999460004008
torch.max(a, b), numel() == 10000 for 200000 times, dtype=torch.int16
2.8086475109994353
torch.max(a, b), numel() == 100000 for 20000 times, dtype=torch.int16
1.405110773999695
torch.max(a, b), numel() == 10000 for 200000 times, dtype=torch.int32
2.913458047999484
torch.max(a, b), numel() == 100000 for 20000 times, dtype=torch.int32
1.4236377289998927
torch.max(a, b), numel() == 10000 for 200000 times, dtype=torch.int64
2.9386842409994642
torch.max(a, b), numel() == 100000 for 20000 times, dtype=torch.int64
1.4230227469997772
torch.min(a, b), numel() == 10000 for 200000 times, dtype=torch.double
3.0341797270002644
torch.min(a, b), numel() == 100000 for 20000 times, dtype=torch.double
1.4289592409995748
torch.min(a, b), numel() == 10000 for 200000 times, dtype=torch.float
3.6091147850002017
torch.min(a, b), numel() == 100000 for 20000 times, dtype=torch.float
2.036691903999781
torch.min(a, b), numel() == 10000 for 200000 times, dtype=torch.int16
2.8256167649997224
torch.min(a, b), numel() == 100000 for 20000 times, dtype=torch.int16
1.4078955400000268
torch.min(a, b), numel() == 10000 for 200000 times, dtype=torch.int32
2.8631781489993955
torch.min(a, b), numel() == 100000 for 20000 times, dtype=torch.int32
1.4210130069996012
torch.min(a, b), numel() == 10000 for 200000 times, dtype=torch.int64
3.0112479260005784
torch.min(a, b), numel() == 100000 for 20000 times, dtype=torch.int64
1.4297719679998409

```

Solve partly #24594 #24595

Close #25016

Test Plan: Imported from OSS

Differential Revision: D18117070

Pulled By: VitalyFedyunin

fbshipit-source-id: e06d37a8a1405848ba0b9e398870a77eb52bae8b
2019-12-05 09:55:56 -08:00
fa251cfd97 Fully deprecate variadic inputs of checkpoint_sequential (#25985)
Summary:
To support variadic inputs of `checkpoint_sequential` was deprecated at https://github.com/pytorch/pytorch/issues/21006. This case should be warned with `DeprecationWarning` for PyTorch 1.2, but it should be simply failed with `TypeError` since PyTorch 1.3. This patch removes the `DeprecationWarning` for PyTorch 1.2.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/25985

Differential Revision: D18809875

Pulled By: albanD

fbshipit-source-id: e84dd8629c04979c4b2dc63e8ada94292e8cedd0
2019-12-05 09:23:28 -08:00
2607772959 Turn off scalar_checks for SpatialDepthwiseConvolution and SpatialConvolutionMM. (#30789)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30789

The input(s) can't be 0-dimensional, so its irrelevant.

Restacked version of: https://github.com/pytorch/pytorch/pull/30438

Test Plan: Imported from OSS

Differential Revision: D18825716

Pulled By: gchanan

fbshipit-source-id: a4883b795163efcb9d8dba6166d0f2102b6728a2
2019-12-05 08:07:31 -08:00
f12332eb51 Move scalar_check from codegen to code in MultiLabelMarginCriterion. (#30770)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30770

Restacked version of: https://github.com/pytorch/pytorch/pull/30753

Test Plan: Imported from OSS

Differential Revision: D18821556

Pulled By: gchanan

fbshipit-source-id: 64b7311b1eb3855c4f1981d060accc918b99088d
2019-12-05 08:07:26 -08:00
50625798df Fix scalar check of MultiLabelMarginLoss. (#30768)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30768

The behavior didn't match the documentation, because the documentation (for 'none' reduction) reads:
input X target -> output
(N, C) X (N, C) -> (N,)
(C,) X (C,) -> ()

but the later case would output (1,).  This also changes the case to:
() X (C,) -> ()
from:
() X (C,) -> (C,)
which makes more sense with the above formulas.

Restacked version of: https://github.com/pytorch/pytorch/pull/30748

Test Plan: Imported from OSS

Differential Revision: D18821554

Pulled By: gchanan

fbshipit-source-id: 3df77c51cf25648cb5fab62a68b09f49c91dab4e
2019-12-05 08:07:20 -08:00
473a044835 Fix a CUDA memory leak in MultiLabelMarginCriterion error checking. (#30767)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30767

Restacked version of: https://github.com/pytorch/pytorch/pull/30733

Test Plan: Imported from OSS

Differential Revision: D18821553

Pulled By: gchanan

fbshipit-source-id: 8bf0365ce54dd2f07a5d6d0937332d0baf75b350
2019-12-05 08:07:15 -08:00
ba1a9871cb Turn off scalar_check for is_target for MultiLabelMarginCriterion, which is handled correctly in code. (#30766)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30766

Restacked version of: https://github.com/pytorch/pytorch/pull/30728

Test Plan: Imported from OSS

Differential Revision: D18821555

Pulled By: gchanan

fbshipit-source-id: 27acc72f82e94eddeea675ae66e010cfb2fc7421
2019-12-05 08:07:10 -08:00
35a6997863 Support 0-d tensors in CUDA MultiLabelMarginCriterion. (#30765)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30765

It is already supported in CPU and is pretty easy to add for consistency.

Restacked version of: https://github.com/pytorch/pytorch/pull/30727

Test Plan: Imported from OSS

Differential Revision: D18821557

Pulled By: gchanan

fbshipit-source-id: e6aa3e91000ff3fd63941defc7d30aef58ae2f82
2019-12-05 08:07:05 -08:00
c4e9748bc6 Provide full path for buck hipification (#30746)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30746

This diff should be safe as long as open source build succeeds and should have no impact to cuda.

Differential Revision: D18811302

fbshipit-source-id: a7adab993816cba51842701898fac5019438b664
2019-12-05 07:57:52 -08:00
f2a2fec47c CUDA-strided-complex Binary and Unary Op support (#30295)
Summary:
In-tree changes to pytorch to support complex numbers are being submitted here.
Out-of-tree support for CUDA complex numbers is here: [pytorch-cuda-strided-complex extension](https://gitlab.com/pytorch-complex/pytorch-cuda-strided-complex)

Changes so far:

- [x]  Added complex support of torch.empty and torch.fill()
- [x]  Added complex support of CopyKernels
    - The 'static_cast_with_inter_type' template function is specialized for the following cases
        - `dest_t = thrust::complex<dest_value_t>`, `src_t = std::complex<src_value_t>`
        - `dest_t = std::complex<dest_value_t>`, `src_t = thrust::complex<src_value_t>`
     - This handles the compile-time case where `dest_value_t=double` and `src_value_t=float`.
- [x]  Added complex support of BinaryOp kernels
    - `using thrust_t = typename ztype_cuda<scalar_t>::thrust_t;` converts std::complex<T> ScalarTypes to thrust types and is a no-op of other Scalar Types.
    - The operator is performed using complex number support defined in `thrust/complex.h`
    - This could be extended to work with ROCm by using `rocm/complex.h`
- [x]  Added complex support of UnaryOp kernels
    - Added CUDA support for `angle()`, `real()`, `imag()`, `conj()`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30295

Differential Revision: D18781954

Pulled By: ezyang

fbshipit-source-id: 25d204c0b8143ee27fda345a5d6a82f095da92a7
2019-12-05 07:30:39 -08:00
139aa51962 Clean up non-C++14 code (#28443)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/28443

We're now on C++14, so we don't need the else branch of these ifdef's anymore
ghstack-source-id: 94904074

Test Plan: waitforsandcastle

Differential Revision: D18069136

fbshipit-source-id: f1613cab9a99ee30f99775e4a60a1b06fd0a03ff
2019-12-05 00:41:29 -08:00
a939b52ddb fix AvgPool2d for 2^31-1 sized inputs, and get test_cuda_kernel_loop_… (#30771)
Summary:
…overflow_large to working state
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30771

Differential Revision: D18821529

Pulled By: ngimel

fbshipit-source-id: c5cbf56e686a2a3cfc7274dd96db37289dac7588
2019-12-04 20:58:30 -08:00
1d20c32bf1 Make InsertQuantDeQuantHelper global (#30550)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30550

Right now we have a `InsertQuantDeQuantHelper` for each module, but we need
it to be global because we need to know what graphs have been quantized before
and based on this information we can decide how to handle the module instance.

Test Plan:
test_jit.py, test_quantization.py

Imported from OSS

Differential Revision: D18818651

fbshipit-source-id: bfcaf37094ce20a257171a0c99b05b9348ebc13d
2019-12-04 20:03:00 -08:00
c4c2e23385 Supporting making submodules unique (#30037)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30037

Support quantization for modules with reused submodules, e.g. relu (automatically make unique)
We first do a pass on the graph to find all duplicate uses of the same module, and record the `Value`s of the
module instance, for each of these values we create a new module and change the access to that module.

Test Plan:
python test/test_jit.py

Imported from OSS

Differential Revision: D18821483

fbshipit-source-id: 1698b981e9e9f0c728d9f03fcbcfbd260151f679
2019-12-04 19:26:56 -08:00
7a2889b014 Stop producing op_version_set version numbers.
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/28122

Test Plan: Imported from OSS

Differential Revision: D17959565

Pulled By: zdevito

fbshipit-source-id: 701101bd870700eb0c9882c69e2cfdd2524b555e
2019-12-04 19:14:43 -08:00
3c1bb21cf5 Invoke more passes in insertObservers (#30473)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30473

Invoked `ConstantPooling` and `FuseLinear` pass before
`insertObservers`.
`ConstantPooling` is for cleanning up traced graph, e.g. when we
have to constant node that has the same value, this pass will merge them,
this allows us to have less quantization patterns
`FuseLinear` is to merge the exploded linear function into `aten::linear` so
that we can quantize this function properly. We need to fuse it because right now
the way we recognize weight and bias is by matching the argument position in certain function
calls, e.g. 1st argument of aten::conv2d is weight. Therefore we have to preserve
the bounary of the linear function to recognize the weight of linear. Since in the exploded
linear code, input of addmm is transposed weight rather than the original weight of linear.
ghstack-source-id: 94887831

Test Plan:
This is needed for quantizing traced model tests to pass

Imported from OSS

Differential Revision: D18795722

fbshipit-source-id: 192d9d1e56307e2e1d90e30dce0502e31cb4f829
2019-12-04 18:45:04 -08:00
e09c415387 Back out "make the order btw div and mul in adagrad update consistent" (#30737)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30737

Original commit changeset: 2a8b2a3f5401

Reverting this to be safe until we address test failures in T58528495

Test Plan: CI

Reviewed By: wx1988

Differential Revision: D18812384

fbshipit-source-id: 2a3ac554024773022ec827f259127e4c8cffe6e2
2019-12-04 17:43:45 -08:00
1f1ce53e8e Don't install pybind11 header directory for system pybind11 installs (#30758)
Summary:
For system pybind11 installs this is a system header location that should not get installed since it might include other unrelated headers. Since the header is already installed for a system install there's no need to install the headers, so only do the install when we use the bundled pybind11 version.

Closes https://github.com/pytorch/pytorch/issues/29823. Closes https://github.com/pytorch/pytorch/issues/30627.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30758

Differential Revision: D18820189

Pulled By: bddppq

fbshipit-source-id: fcc9fa657897e18c07da090752c912e3be513b17
2019-12-04 16:43:21 -08:00
569ea63f3b fix anynonzero op
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/29423

Test Plan: Imported from OSS

Differential Revision: D18820523

fbshipit-source-id: 55c7a1911121f0aed008bd684b448151bbbf0a8a
2019-12-04 16:40:43 -08:00
1d8a13147c Updating submodules
Summary:
GitHub commits:

1e345af4de
61d54df22c
dab87e19bf

Test Plan: n/a

Reviewed By: yns88

fbshipit-source-id: 88e55e94c7473a7a310338eaaf508e7fc71e0df6
2019-12-04 16:40:39 -08:00
cd032c7f6a Updating submodules
Summary:
GitHub commits:

b94ef9fb23
4462a7f00a
16e629c415
50770702ad
5b632a5deb
d2fa2cbcd6
4e152f651e
54c89b5f03

Test Plan: n/a

Reviewed By: yns88

fbshipit-source-id: 766783d00f8440c1264f13045ae6411233355af6
2019-12-04 14:56:01 -08:00
1707774417 AddConstant and findConstant for ClassType (#29217)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/29217

We want to preserve constant information in ClassType so that
users can access the constants in the module by name.
This is also used later for freezing some attribute(converting
attributes to constant)

Test Plan:
tbd

Imported from OSS

Differential Revision: D18799955

fbshipit-source-id: fbfbcd5d3f7f560368b96e2a87e270c822a3d03a
2019-12-04 14:17:13 -08:00
2308a0ec1b Improve documentation around builtin functions (#30347)
Summary:
This breaks the builtins page into some more sections and adds details about Python built-in functions
](https://our.intern.facebook.com/intern/diff/18718166/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30347

Pulled By: driazati

Reviewed By: wanchaol

Differential Revision: D18718166

fbshipit-source-id: bf43260ab7bcf92cccef684a5ce68cb16020771d
2019-12-04 13:50:40 -08:00
42e79d7e8a Kill THNN version of MultiMarginCriterion; it's not used anymore.
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/30725

Test Plan: Imported from OSS

Differential Revision: D18808767

Pulled By: gchanan

fbshipit-source-id: bcc4a6e272036f3d167fc158a53fe7aa1dec51f9
2019-12-04 13:46:32 -08:00
9d3402e4cb Add the __torch_function__ API override mechanism (#30730)
Summary:
This is a re-do of https://github.com/pytorch/pytorch/issues/27064, which was reverted (b8792c0438). This was landed at the same time as other work that added new operators to the `torch` namespace so the check for whether the `torch` namespace is exhaustively checked for overridability was triggering test failures.

I've temporarily disabled that check and added an explanatory comment that the check will be re-enabled in a future PR that will be merged during a time when the commit velocity on PyTorch is lower.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30730

Differential Revision: D18813270

Pulled By: ezyang

fbshipit-source-id: 70477c4656dca8fea6e7bc59259555041fcfbf68
2019-12-04 13:19:07 -08:00
289e9a07fd Move Tanh backward to Aten(CPU+CUDA) (#30224)
Summary:
VitalyFedyunin, This PR is about port Tanh backward to Aten:
**Test script:**
```
import torch
import torch.nn as nn
import time

torch.manual_seed(0)
def _time():
    if torch.cuda.is_available():
        torch.cuda.synchronize()
    return time.time()

device = "cpu"
m = nn.Tanh()
if torch.cuda.is_available():
    device = "cuda"
    m = m.cuda()

#warm up
for n in [100, 10000]:
    input = torch.randn(128, n, requires_grad=True, device=device)
    grad_output = torch.ones(128, n, device=device)
    for i in range(1000):
        output = m(input)
        output.backward(grad_output)

for n in [100, 10000]:
    input = torch.randn(128, n, requires_grad=True, device=device)
    grad_output = torch.ones(128, n, device=device)
    bwd_t = 0
    for i in range(10000):
        output = m(input)
        t1 = _time()
        output.backward(grad_output)
        t2 = _time()
        bwd_t = bwd_t + (t2 - t1)
    bwd_avg = bwd_t / 10000 * 1000
    print("input size(128, %d)  backwad avg time is %.2f (ms)." % (n, bwd_avg))
```
Test Device: CPU: skx-8180, GPU: Tesla P40.
Perfromance:
Before:
```
GPU:
input size(128, 100) backwad avg time is 0.12 (ms).
input size(128, 10000) backwad avg time is 0.17 (ms).
CPU
input size(128, 100) backwad avg time is 0.05 (ms).
input size(128, 10000) backwad avg time is 0.35 (ms).
```
After:
```
GPU:
input size(128, 100) backwad avg time is 0.12 (ms).
input size(128, 10000) backwad avg time is 0.17 (ms).
CPU
input size(128, 100) backwad avg time is 0.04 (ms).
input size(128, 10000) backwad avg time is 0.25 (ms).
```
`OMP_NUM_THREADS=1:`
```
Before:
input size(128, 100) backwad avg time is 0.03 (ms).
input size(128, 10000) backwad avg time is 1.85 (ms).
After:
input size(128, 100) backwad avg time is 0.02 (ms).
input size(128, 10000) backwad avg time is 1.16 (ms).
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30224

Differential Revision: D18810045

Pulled By: VitalyFedyunin

fbshipit-source-id: ab37948ab8f76bdaf9f3d1388562eaf29dacc0ea
2019-12-04 12:55:33 -08:00
d38f9117fd Cache compilation of free functions (#30503)
Summary:
We don't have to recompile free functions if we've already compiled them.

Improved compilation of resnet18 by 27%.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30503

Differential Revision: D18796501

Pulled By: eellison

fbshipit-source-id: 2dee0fc5fcf9adc5b92213f8cb813730d71b376f
2019-12-04 12:45:35 -08:00
9d69c55b0d add MaskedRowWiseSparseAdagrad
Summary: As title

Test Plan: buck test caffe2/caffe2/fb/optimizers:masked_adagrad_test

Reviewed By: chocjy

Differential Revision: D18736639

fbshipit-source-id: d0d73f75228604d3448651bff2cf34ecc21f9ba6
2019-12-04 12:36:09 -08:00
786de33832 Move scalar_check logic from codegen to code in NLLLoss. (#30670)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30670

Also turn off scalar_check for grad_input: it isn't necessary because the input can't be 0-dimensional.

Test Plan: Imported from OSS

Differential Revision: D18784523

Pulled By: gchanan

fbshipit-source-id: 246d30970457075a0403dd0089317659a2cd2dd4
2019-12-04 12:30:23 -08:00
fa2aa245cf Simplify scalar_check of nll_loss. (#30669)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30669

The inputs can't be 0-d, so we don't need that check in the scalar_check.

Test Plan: Imported from OSS

Differential Revision: D18784524

Pulled By: gchanan

fbshipit-source-id: d44222dffc91880a6e8c7be69e6e146e60040d43
2019-12-04 12:30:19 -08:00
6918f0ce86 Move scalar_check for total_weight in NLLLoss functions to code from codegen. (#30665)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30665

total_weight is a "hidden" output just for autograd, so it's not user visible.  The existing test_nn tests cover this (I verified that the new code is executed) and this matches the CPU behavior.

Test Plan: Imported from OSS

Differential Revision: D18782709

Pulled By: gchanan

fbshipit-source-id: 6d1c20eeaeffa14d06f375b37f11e866587f5fa0
2019-12-04 12:30:14 -08:00
756f279d95 Rename QuantizeHelper to InsertQuantDeQuantHelper (#30549)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30549

Preparing for later refactoring

Test Plan:
.

Imported from OSS

Differential Revision: D18802464

fbshipit-source-id: 0b5afb143549d93eed4c429125d3d5fd253093a9
2019-12-04 10:40:22 -08:00
f73cd28082 InsertObservers for shared class types (#30548)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30548

ClassTypes can be shared among different module instances, but previously we assumed
they would be unique, this PR enables the insert_observers pass to work with shared class types

Test Plan:
python test/test_jit.py
python test/test_quantization.py

Imported from OSS

Differential Revision: D18802465

fbshipit-source-id: b782e71e44a043af45577ac2b5c83e695155bb8b
2019-12-04 09:34:47 -08:00
6e145b4614 add irregular c10 op registration/invocation cases to test project (#30558)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30558

Most c10 op registration/invocation cases are generated by aten codegen
following some fixed pattern, but a handful of them were written
manually, mainly for quantized ops. Added these "irregular" cases to the
test project to verify static code analyzer can handle them as well.

Test:
- build and run the test project;

Test Plan: Imported from OSS

Differential Revision: D18811098

Pulled By: ljk53

fbshipit-source-id: 7bdf17175dfec41c56c0d70f124cc96478135bc4
2019-12-04 08:46:00 -08:00
a55f125e3b Check the error return of nvrtcGetProgramLogSize and nvrtcGetProgramLog (#30663)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30663

Yes they can fail.  See https://github.com/ROCm-Developer-Tools/HIP/issues/1706

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Differential Revision: D18810088

Pulled By: ezyang

fbshipit-source-id: 96186e71c9a195bdbbed811e7ba8dc40bec09eae
2019-12-04 08:37:43 -08:00
ca072951d5 move MaskedAdagrad to caffe2/operators/experimental/optimizers (#30714)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30714

Move Masked*Adagrad operators so caffe2/python/optimizer.py can use them.

Test Plan: buck test caffe2/caffe2/operators/experimental/optimizers:masked_adagrad_test

Reviewed By: chocjy

Differential Revision: D18805532

fbshipit-source-id: 49b1f755b31296c62e7a6a8134313b962ad9690c
2019-12-04 08:29:13 -08:00
d0af07ca4c Fix capitalization inconsistency in optim.rst
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/30608

Differential Revision: D18808516

Pulled By: ezyang

fbshipit-source-id: 4be68be9a8c8c3da7a0b98162bc1050b588fab43
2019-12-04 08:17:03 -08:00
38986e1dea Split libtorch.so back into libtorch_{cpu,cuda,hip} (#30315)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30315

The new structure is that libtorch_cpu contains the bulk of our
code, and libtorch depends on libtorch_cpu and libtorch_cuda.
This is a reland of https://github.com/pytorch/pytorch/pull/29731 but
I've extracted all of the prep work into separate PRs which can be
landed before this one.

Some things of note:

* torch/csrc/cuda/nccl.cpp was added to the wrong list of SRCS, now fixed (this didn't matter before because previously they were all in the same library)
* The dummy file for libtorch was brought back from the dead; it was previously deleted in #20774
In an initial version of the patch, I forgot to make torch_cuda explicitly depend on torch_cpu. This lead to some very odd errors, most notably "bin/blob_test: hidden symbol `_ZNK6google8protobuf5Arena17OnArenaAllocationEPKSt9type_infom' in lib/libprotobuf.a(arena.cc.o) is referenced by DSO"
* A number of places in Android/iOS builds have to add torch_cuda explicitly as a library, as they do not have transitive dependency calculation working correctly
* I had to torch_cpu/torch_cuda caffe2_interface_library so that they get whole-archived linked into torch when you statically link. And I had to do this in an *exported* fashion because torch needs to depend on torch_cpu_library. In the end I exported everything and removed the redefinition in the Caffe2Config.cmake. However, I am not too sure why the old code did it in this way in the first place; however, it doesn't seem to have broken anything to switch it this way.
* There's some uses of `__HIP_PLATFORM_HCC__` still in `torch_cpu` code, so I had to apply it to that library too (UGH). This manifests as a failer when trying to run the CUDA fuser. This doesn't really matter substantively right now because we still in-place HIPify, but it would be good to fix eventually. This was a bit difficult to debug because of an unrelated HIP bug, see https://github.com/ROCm-Developer-Tools/HIP/issues/1706

Fixes #27215 (as our libraries are smaller), and executes on
part of the plan in #29235.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Differential Revision: D18790941

Pulled By: ezyang

fbshipit-source-id: 01296f6089d3de5e8365251b490c51e694f2d6c7
2019-12-04 08:04:57 -08:00
1189595875 Fix Tensor.argsort -> torch.argsort documentation link
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/30464

Differential Revision: D18717657

Pulled By: zou3519

fbshipit-source-id: 9894f63c6cb1b5311117441e78805230d1bc09f3
2019-12-04 07:49:38 -08:00
b8792c0438 Revert D18645954: add __torch_function__ API override mechanism
Test Plan: revert-hammer

Differential Revision:
D18645954

Original commit changeset: 54b5e4344d7a

fbshipit-source-id: 4a7aebb483e6b001130d6f384ccc53c5a808ab13
2019-12-04 07:41:47 -08:00
a68b790293 fix ref to nonexistent torch.repeat
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/30614

Differential Revision: D18808517

Pulled By: ezyang

fbshipit-source-id: 27f9bda6fbbd1c3c751a0e96fdc336bf724c0b31
2019-12-04 07:27:01 -08:00
ec7bb9de1c format tri[lu]_indices doc better
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/30377

Differential Revision: D18689152

Pulled By: zou3519

fbshipit-source-id: 7fab1e39ecd39ef6a3869befcbe217f8d3b6a87e
2019-12-04 07:16:34 -08:00
d6ca93b353 add doc for F.softplus
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/30055

Differential Revision: D18762624

Pulled By: zou3519

fbshipit-source-id: 61da88cbb8cd0f37ac26b0fb8aaacdbe85c724ba
2019-12-04 07:16:30 -08:00
d12786b24f add __torch_function__ API override mechanism (#27064)
Summary:
Closes https://github.com/pytorch/pytorch/issues/24015 (see description of that issue for more details).

For a toy example, see the `DiagonalTensor` and `SubDiagonalTensor` class in test/test_overrides.py.

This PR currently contains:

* tests for `__torch_function__` behavior
* modification to `gen_python_functions` and `parse` function signatures and dispatched to correct overloaded argument.

This feature is inspired by and analogous to NumPy's `__array_function__` protocol ([see NumPy Enhancement Proposal 18](https://numpy.org/neps/nep-0018-array-function-protocol.html#trying-array-function-methods-until-the-right-one-works)).

### Benchmarks:
See Nathan's comment below: https://github.com/pytorch/pytorch/pull/27064#issuecomment-554601189
Pull Request resolved: https://github.com/pytorch/pytorch/pull/27064

Differential Revision: D18645954

Pulled By: ezyang

fbshipit-source-id: 54b5e4344d7afdbcf996bb57191b0bdadc7b1767
2019-12-04 05:56:46 -08:00
c0299d2707 add LLVM code analyzer in order to replace static dispatch
Summary:
[Why static dispatch]
Static dispatch was introduced to allow stripping out unused ops at link
time (with “gc-sections” linker flag) for mobile build.

The alternative approaches to do "non-static" dispatch are:
* virtual methods - old ATen dispatcher, which has already been deprecated;
* registry pattern - used by caffe2, c10 and JIT;

However, none of them are “gc-sections” friendly. Global registers are
root symbols - linker cannot strip out any op if we use registry pattern
for mobile.

[Why static dispatch isn’t great]
* One more code path to maintain;
* Need recompile framework to add new backends/ops;
* Doesn’t support AutoGrad yet thus blocks on-device training;

[Static Code Analysis]
This PR introduces a LLVM analysis pass. It takes LLVM bitcode /
assembly as input and generates dependecy graph among aten ops. From a
set of root ops used by a model, we can calculate transitive closure of
all dependent ops, then we can ask codegen to only register these ops.

[Approach]
To generate the dependency graph it searches for 3 types of connections in
LLVM bitcode / assembly:
 1) op registration: op name (schema string literal) -> registered function;
 2) regular function call: function -> function;
 3) op invocation: function -> op name (schema string literal)

For 2) it uses similar algorithm as llvm::LazyCallGraph - not only looks into
call/invoke instructions but also recursively searches for function pointers
in each instruction's operands.

For 1) and 3) it searches for connections between operator name string
literals / function pointers and c10 op registration/invocation API calls in
LLVM IR graph via "use" edges (bi-directional):
 1. llvm::Value has "users()" method to get other llvm::Value nodes that use
    the value;
 2. most of types derive from llvm::User which has "operands()" method to get
    other llvm::Value nodes being used by the value;

[Limitation]
For now the search doesn't go beyond the function boundary because the
reference to op name string literals and c10 op registration/invocation
APIs are almost always in the same function.

The script uses regular expression to identify c10 API calls:
* op_schema_pattern="^(aten|quantized|profiler|_test)::[^ ]+"
* op_register_pattern="c10::RegisterOperators::(op|checkSchemaAndRegisterOp_)"
* op_invoke_pattern="c10::Dispatcher::findSchema|callOp"

If we create helper function around c10 API (e.g. the "callOp" method
defined in aten/native), we could simply add them to the regular expression
used to identify c10 API.

[Example]
In the following example, it finds out:
 1) the registered function for "quantized:add" operator;
 2) one possible call path to at::empty() function;
 3) the called operator name "aten::empty":

- "quantized::add"
- c10::detail::wrap_kernel_functor_unboxed_<at::native::(anonymous namespace)::QAdd<false>, at::Tensor (at::Tensor, at::Tensor, double, long)>::call(c10::OperatorKernel*, at::Tensor, at::Tensor, double, long)
- at::native::(anonymous namespace)::QAdd<false>::operator()(at::Tensor, at::Tensor, double, long)
- void at::native::DispatchStub<void (*)(at::Tensor&, at::Tensor const&, at::Tensor const&), at::native::qadd_stub>::operator()<at::Tensor&, at::Tensor const&, at::Tensor const&>(c10::DeviceType, at::Tensor&, at::Tensor const&, at::Tensor const&)
- at::native::DispatchStub<void (*)(at::Tensor&, at::Tensor const&, at::Tensor const&), at::native::qadd_stub>::choose_cpu_impl()
- void at::native::(anonymous namespace)::qadd_kernel<false>(at::Tensor&, at::Tensor const&, at::Tensor const&)
- at::TensorIterator::binary_op(at::Tensor&, at::Tensor const&, at::Tensor const&, bool)
- at::TensorIterator::build()
- at::TensorIterator::fast_set_up()
- at::empty(c10::ArrayRef<long>, c10::TensorOptions const&, c10::optional<c10::MemoryFormat>)
- "aten::empty"

[How do we know it’s correct?]
* Built a test project that contains different op registration/invocation
  patterns found in pytorch codebase, including both codegen and non-codegen
  cases.
* Tried different optimization flags “-O0”, “-O3” - the result seems to
  be stable.
* Filtered by common patterns: “aten::”, “at::”, “at::native”,
  “at::CPUType”, “at::TypeDefault” - manually checked the relationship
  between function schema strings and corresponding implementations were
  captured.
* It can print instruction level data flow and show warning message if it
  encounters unexpected cases (e.g.: found 0 or multiple op names per
  registration/invocation API call, found 0 registered functions, etc).
* Verified consistent results on different linux / macOs hosts. It can
  handle different STL library ABI reliably, including rare corner cases
  for short string literals

[Known issues]
* Doesn’t handle C code yet;
* Doesn’t handle overload name yet (all variants are collapsed into the
  main op name);

Test Plan:
```
LLVM_DIR=... ANALYZE_TEST=1 CHECK_RESULT=1 scripts/build_code_analyzer.sh
```

Differential Revision: D18428118

Pulled By: ljk53

fbshipit-source-id: d505363fa0cbbcdae87492c1f2c29464f6df2fed
2019-12-04 01:02:33 -08:00
f5c9452beb Fix toObject() r-value version (#30713)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30713

It should use moveToIntrusivePtr.
This function is a very hot one and used a lot in interpreter loop. e.g.
GET_ATTR, SET_ATTR. Making a copy and doing incref/decref caused big overhead.

Reviewed By: yinghai

Differential Revision: D18805212

fbshipit-source-id: 3a9368604f71638a21300ad086739c4b50f0644e
2019-12-04 00:19:35 -08:00
d456a538f9 op dependency analysis bash driver
Summary:
Move the shell script into this separate PR to make the original PR
smaller and less scary.

Test Plan:
- With stacked PRs:
1. analyze test project and compare with expected results:
```
ANALYZE_TEST=1 CHECK_RESULT=1 tools/code_analyzer/build.sh
```

2. analyze LibTorch:
```
ANALYZE_TORCH=1 tools/code_analyzer/build.sh
```

Differential Revision: D18474749

Pulled By: ljk53

fbshipit-source-id: 55c5cae3636cf2b1c4928fd2dc615d01f287076a
2019-12-04 00:12:24 -08:00
7e472679ff pin actions/checkout version
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/30703

Test Plan: Imported from OSS

Differential Revision: D18805447

Pulled By: suo

fbshipit-source-id: d58ebe0e90b81c9282d3977f36c53c54cac750d9
2019-12-03 20:52:54 -08:00
b26401f965 Dump operator names of a script module (#30467)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30467

Introduce function jit.export_opnames(module), which returns a list of all operator names used in the module and its submodules. One usage is to have mobile custom build to link only operators in the returned list to save the mobile size.

Example:
import torch
m = torch.jit.load("example.pt")
print(torch.jit.export_opnames(m))

The outputs are in alphabetical order:
['aten::_convolution', 'aten::add.Tensor', 'aten::add_.Tensor', 'aten::addmm', 'aten::append.Tensor', 'aten::cat', 'aten::dropout', 'aten::embedding', 'aten::matmul', 'aten::max.dim', 'aten::mul.Tensor', 'aten::permute', 'aten::relu', 'aten::t', 'aten::tanh', 'prim::ListConstruct', 'prim::TupleConstruct', 'prim::TupleUnpack']

Test Plan: Imported from OSS

Differential Revision: D18801619

Pulled By: iseeyuan

fbshipit-source-id: f9b198d3e82b095daf704ee595d8026ad889bb13
2019-12-03 20:20:33 -08:00
63a1542ed2 Adding Debug Info for RRef Context
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/30610

Test Plan: Imported from OSS

Differential Revision: D18763592

Pulled By: mrshenli

fbshipit-source-id: ad8854bdb6250c29eaa0f582d66cfd31394312e5
2019-12-03 19:16:31 -08:00
6dda241ab8 Add RRef.__str__() API
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/30609

Test Plan: Imported from OSS

Differential Revision: D18763593

Pulled By: mrshenli

fbshipit-source-id: 20f1eea2d6cfe9ab2a27a9677d97dde07c1dca9b
2019-12-03 19:16:26 -08:00
bb5dcaf24f Add logical_and and logical_or (#30521)
Summary:
With the CI failure caused in 8bbafa0b32d2899ef6101172d62c6049427c977b fixed (incorrect return type of the lambdas in CUDA kernels)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30521

Differential Revision: D18770151

Pulled By: ailzhang

fbshipit-source-id: 02f0fe1d5718c34d24da6dbb5884ee8b247ce39a
2019-12-03 18:24:54 -08:00
ab834d5093 Remove exp10 in TH (unused)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/30422

Test Plan: Imported from OSS

Differential Revision: D18764280

Pulled By: VitalyFedyunin

fbshipit-source-id: 626b88a115f2efce4a53c6784f0a6660b36c97f9
2019-12-03 18:17:24 -08:00
76acf5b553 Remove many unused bfloat16 functions in TH
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/30329

Test Plan: Imported from OSS

Differential Revision: D18764281

Pulled By: VitalyFedyunin

fbshipit-source-id: bc3f91c6d09d4f73c77fe1492a358128744aee76
2019-12-03 18:17:19 -08:00
4ac614191a Remove exp10 in TH (unused)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/30422

Test Plan: Imported from OSS

Differential Revision: D18764186

Pulled By: VitalyFedyunin

fbshipit-source-id: 9343a5a7e4edf61ba3b85eaf846b2e149ed6529a
2019-12-03 18:17:15 -08:00
ea3697db69 inline to prevent duplicate obj when linking (#30363)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30363

getting duplicate definition errors when linking test.
ghstack-source-id: 94472892

Test Plan: CI passes

Differential Revision: D18669686

fbshipit-source-id: 3d3bfc38e4247cf8bea655537824b891b84f67bc
2019-12-03 15:59:25 -08:00
3cf8382984 detect_anomaly() for SparseTensors (#29803)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/28649

1. Modified detect_anomaly() to use isnan()
2. isnan() for SparseTensors returns a bool Tensor of _values.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/29803

Differential Revision: D18594299

Pulled By: ezyang

fbshipit-source-id: 3f4190c569f53219be330584fc604ca43c4a6c7a
2019-12-03 15:42:51 -08:00
fef4360536 remove default constructor in futureInfo (#30197)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30197

This default constructor was added because std::map's operator[]
requires a default constructor. However, instead of using operator[], we can
use emplace and remove the constructor, to ensure that the FutureInfo struct
doesnt get constructed with garbage values.
ghstack-source-id: 94802453

Test Plan: Unit tests pass.

Differential Revision: D18627675

fbshipit-source-id: c4cb000e60081478c0fd7308e17103ebbc4dc554
2019-12-03 15:36:22 -08:00
59151d3e43 autograd/profiler: support merging FunctionEventAvg (#30677)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30677

Currently you can only add FunctionEvents to FunctionEventAvg. This makes it so you can add multiple FunctionEventAvg objects together. This is useful for merging multiple profiles together such as when dealing with distributed training.

Test Plan:
added unit test

  buck test //caffe2/test:autograd -- test_profiler

Reviewed By: bddppq

Differential Revision: D18785578

fbshipit-source-id: 567a441dec885db7b0bd8f6e0ac9a60b18092278
2019-12-03 15:28:58 -08:00
dcd1216efe Force early initialization of OpenMP in forked children (#29006)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/28389

Intel's OpenMP implementation sets the thread affinity on the first call to an OpenMP function after a fork. By adding an atfork handler we can force this to happen before a user tries to set the affinity in their own DataLoader `worker_init_fn`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/29006

Differential Revision: D18782456

Pulled By: ezyang

fbshipit-source-id: ce0b515256da0cf18ceb125e0cdec99a3311bbd3
2019-12-03 15:23:31 -08:00
a376dd344c Added check for torch.where on CPU that both arguments have same dtype (#30662)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30662

Cherry picked from: https://github.com/pytorch/pytorch/pull/29081

Test Plan: Imported from OSS

Differential Revision: D18782295

Pulled By: nairbv

fbshipit-source-id: 897ab25ddf8819ca34f5e86c5d3f41debb56cb04

Co-authored-by: ifedan
2019-12-03 15:19:52 -08:00
56dd2836ec Make zeros argument of torch.where same dtype as other argument (#30661)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30661

Cherry-picked from https://github.com/pytorch/pytorch/pull/29080

Test Plan: Imported from OSS

Differential Revision: D18781870

Pulled By: nairbv

fbshipit-source-id: 9de85aa91bf7e0856f35c7c6238a8923315ed27f

Co-authored-by: ifedan
2019-12-03 15:19:48 -08:00
2ba03e0287 Enable test_trainer_ps in dist_autograd_test.py
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/30341

Test Plan: Imported from OSS

Differential Revision: D18769574

Pulled By: mrshenli

fbshipit-source-id: caf25742fa1fc9dbf6486f5ec981fae3f29784bc
2019-12-03 15:12:36 -08:00
d4c25add45 make sure the counter stays correct in between bailout transitions (#30186)
Summary:
This fixes the second issue reported in https://github.com/pytorch/pytorch/issues/29909 namely, a loop counter is assigned the wrong values after transitioning to a bailout graph.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30186

Differential Revision: D18646845

Pulled By: Krovatkin

fbshipit-source-id: 1f7c601dd9f35892979385ffa132fb0886a4f203
2019-12-03 14:59:08 -08:00
03a73cb9ac Remove namespace F = torch::nn::functional from torch/nn/modules/batchhnorm.h (#30684)
Summary:
This PR removes `namespace F = torch::nn::functional` from `torch/nn/modules/batchhnorm.h`, so that people don't have to define `torch::nn::functional` as `F` if they don't want to.

Fixes https://github.com/pytorch/pytorch/issues/30682.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30684

Differential Revision: D18795717

Pulled By: yf225

fbshipit-source-id: c9feffbeb632cc6b4ce3e6c22c0a78533bab69ad
2019-12-03 14:52:23 -08:00
604a27361f remove tuple_parser (#30659)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30659

I could only find one usage of TupleParser and it doesn't seem worth maintaining just for that one usage.

Test Plan: Imported from OSS

Differential Revision: D18795979

Pulled By: nairbv

fbshipit-source-id: 6e50d65fc8fade0944f36ab20d00f1539a3d4cb8
2019-12-03 14:49:59 -08:00
4d4d8e0dce Update persons_of_interest.rst (#30647)
Summary:
Adding back the 3 names for the MSFT team - re: ONNX Governance.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30647

Differential Revision: D18781163

Pulled By: jlin27

fbshipit-source-id: 7284ba29841ab41b9807c9d92694630b50de7b6a
2019-12-03 14:46:15 -08:00
4e6379379c fetch before checking out PR tip
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/30680

Test Plan: Imported from OSS

Differential Revision: D18796189

Pulled By: suo

fbshipit-source-id: 99da48e5fd510ffdf4e606c2393eb55d4f6ca8d5
2019-12-03 14:43:19 -08:00
980aead1f8 Add support for quantized slice conversion (#30498)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30498

Updated Int8SliceOp to accept dim, start and end index similar to Pytorch.

Test Plan:
python test/onnx/test_pytorch_onnx_caffe2_quantized.py TestQuantizedOps.test_slice

Imported from OSS

Differential Revision: D18740519

fbshipit-source-id: 2313f37a4936edb150ce04911b241e591e191801
2019-12-03 14:37:59 -08:00
bc2e6d10fa Back out "Revert D17908478: Switch PyTorch/Caffe2 to C++14"
Summary: Original commit changeset: 775d2e29be0b

Test Plan: CI

Reviewed By: mruberry

Differential Revision: D18775520

fbshipit-source-id: a350b3f86b66d97241f208786ee67e9a51172eac
2019-12-03 14:33:43 -08:00
aff693ab1c Ensure MIOpen is called on same stream as operator for RNN (#30672)
Summary:
To ensure synchronization between copying of weights in RNN wei buf, and the operation, both the pyTorch operator as well as underlying MIOpen call must be on the same HIP stream. This is also consistent with MIOpen calls in other pyTorch operators

ezyang iotamudelta
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30672

Differential Revision: D18785683

Pulled By: bddppq

fbshipit-source-id: 144611046cb70cfe450680295734203f253ac6e2
2019-12-03 14:28:45 -08:00
40146eb48e Skip ProcessGroupGlooAyncTest if there is no CUDA available (#30345)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30345

Skip ProcessGroupGlooAyncTest if there is no CUDA available, otherwise in sandcastle non GPU host the test will abort with failing to load CUDA library
ghstack-source-id: 94771241

Test Plan: test skipped on non GPU host

Differential Revision: D18665322

fbshipit-source-id: 8c7b89aeecc6ec007bee12d864a6058384254e61
2019-12-03 13:27:34 -08:00
19cd90d303 Globally record observer nodes (#30547)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30547

att

Test Plan:
test_jit.py test_quantization.py

Imported from OSS

Differential Revision: D18784752

fbshipit-source-id: 000e140aa86ff12a240d98da71871a5a5053401f
2019-12-03 12:16:00 -08:00
1b5ce05924 don't use size()/stride() functions in TensorImpl, use size_[d]/stride_[d] instead (#30452)
Summary:
This improved multi-d microbenchmark by ~100 ns, empty_tensor_restride used to be 13% of iteration time, now about 5%
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30452

Test Plan: Covered by existing tests

Differential Revision: D18704233

Pulled By: ngimel

fbshipit-source-id: be527f09183bc31e9d1f63fd49bfbe0998fe167f
2019-12-03 11:38:07 -08:00
7023e13fbb Fix mapping white list (#30636)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30636

Currently DeQuantStub is still in whitelist because set union has
lower precedence than set difference
fix issue: https://github.com/pytorch/pytorch/issues/29646

Test Plan:
verified locally that we don't attach qconfig for DeQuantStub

Imported from OSS

Differential Revision: D18775275

fbshipit-source-id: 8da07e40963555671b3d4326c9291706103f858e
2019-12-03 11:34:28 -08:00
f114c33e69 Fix iOS CI (#30327)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30327

### Summary

Seems like starting from macOS 10.15, we can no longer get access to the `Downloads` folder in our macOS machines.

```
permissionError: [Errno 1] Operation not permitted: '/Users/distiller/Downloads'
```

The fix is to change the conda download directory to ${HOME}

### Test Plan

- iOS jobs are back to normal
- Don't break other jobs

Test Plan: Imported from OSS

Differential Revision: D18717380

Pulled By: xta0

fbshipit-source-id: cad754076bf4ae5035741aa57a310ad87c76726e
2019-12-03 11:24:21 -08:00
1b12fd33ed Add missing trigramma_stub definition. (#30314)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30314

Somehow we forgot to define it!

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Differential Revision: D18762356

Pulled By: ezyang

fbshipit-source-id: 28afc605ad986266071e3831049ec8a7f71fd695
2019-12-03 10:46:52 -08:00
a009fc14be Workaround hcc bug regarding extern "C" definitions (#30313)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30313

See comments in code about the bug.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Differential Revision: D18762360

Pulled By: ezyang

fbshipit-source-id: 406a01f2f0c3722b381428c89afd67b3c3c19142
2019-12-03 10:46:48 -08:00
8269f7b652 Delete redundant THC_API on THCStorage_new (#30312)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30312

It's not necessary because it's already defined in the header.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Differential Revision: D18762363

Pulled By: ezyang

fbshipit-source-id: 418bf355d460dd171ac449559f20bf55415e54ae
2019-12-03 10:46:43 -08:00
d43e205026 Properly include declaration of dispatch in file that registers it. (#30311)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30311

multinomial_stub must be in scope to register against it.  Somehow,
this works today, but when I split torch_cpu and torch_cuda it
doesn't.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Differential Revision: D18762358

Pulled By: ezyang

fbshipit-source-id: ef9c111292cd02d816af1c94c8bbaadabffaabe5
2019-12-03 10:46:38 -08:00
a5b1f6e7d7 Add missing _API definitions. (#30310)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30310

- Annotate CUDAGenerator.h with correct TORCH_CUDA_API.
  This is actually CUDA related functionality with its implementation living
  in the cuda/ folder.  For some reason it lives at the top level; it
  should be moved (but that should be handled in another PR.)
- Add missing TORCH/CAFFE_API annotations to.  All of
  these functions are used from CUDA code, which means that
  we need to correctly annotate them if we split CPU/CUDA code
  into separate libraries.

Test Plan: Imported from OSS

Differential Revision: D18762357

Pulled By: ezyang

fbshipit-source-id: c975a8e4f082fe9f4196c2cca40977623caf4148
2019-12-03 10:46:32 -08:00
08394cede3 DEFINE_DISPATCH in the correct namespace. (#30308)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30308

Dispatch is declared in non-anonymous namespace, so it definitely
shouldn't be defined in an anonymous namespace.  This doesn't seem
to matter today, but it matters when we split libtorch into two
libraries.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Differential Revision: D18762361

Pulled By: ezyang

fbshipit-source-id: 484f0fab183c385dd889db9dad3e48e92e0a3900
2019-12-03 10:46:27 -08:00
9740011f10 Use normal dispatch to get to CUDA threshold kernels, instead of DispatchStub. (#30307)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30307

DispatchStub will stop working when I split CPU/CUDA libraries, because
there are some symbols from the templates in DispatchStub stubs which aren't
properly exported and I couldn't figure out how to make them dispatch properly.

This is the only case where DispatchStub is being used to dispatch to CUDA,
anyway.

This partially addresses #29844 but I need to also just completely delete
the CUDA registration logic from DispatchStub entirely.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Differential Revision: D18762362

Pulled By: ezyang

fbshipit-source-id: bdfa8739c0daf23badf3c5af61890a934af00813
2019-12-03 10:46:22 -08:00
a997f224ac Add torch.multiprocessing.create_processes
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/28493

Differential Revision: D18766066

Pulled By: ailzhang

fbshipit-source-id: 7f424c8fae3012be2416cf9bc72ee2dde40c1f89
2019-12-03 10:38:19 -08:00
4d30415f12 Add ONNX Scripting Conv Support (#30618)
Summary:
Convolution nodes are traced as aten:_convolution and are currently supported in ONNX.
Scripting convolution uses aten:conv<1,2,3>d which are currently not supported in ONNX.
This PR adds the symbolics for aten:conv<1,2,3>d and aten:conv_transpose<1,2,3>d
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30618

Reviewed By: hl475

Differential Revision: D18778145

Pulled By: houseroad

fbshipit-source-id: 4af0379f29974a1ce8443024d1d87b3eb8d2dd36
2019-12-03 10:28:38 -08:00
89be1a22d4 split getInvokedMethods (#30546)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30546

factor out this function for later support of quantizing shared types

Test Plan:
test_jit.py, test_quantization.py

Imported from OSS

Differential Revision: D18776304

fbshipit-source-id: f5a736b0f69019cefe17ec4517da1ae5462f78e1
2019-12-03 10:11:57 -08:00
d5c136097a improve .view() performance (#30554)
Summary:
Improve .view() performance by not calling set_ and instead restriding returned alias. This improves performance of .view() operation from ~500ns to ~360 ns
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30554

Test Plan: covered by existing tests

Differential Revision: D18759896

Pulled By: ngimel

fbshipit-source-id: 9757c93158bc55e9c87dc30ac3415ba8f8b849e5
2019-12-03 09:17:43 -08:00
5a484245d9 Change test_invalid_names test to only test constructor of WorkerInfo (#30620)
Summary:
This tests seems to only test that we throw exceptions in the `WorkerInfo` constructor when invalid names are passed in, so I don't think we need to complicate by initializing RPC, and exposing ourselves to potential flakiness.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30620

Differential Revision: D18766955

Pulled By: rohan-varma

fbshipit-source-id: 11643de4d57431e5f46e096c7766de3ab0b9b05a
2019-12-03 09:07:10 -08:00
f9f54201d3 Remove deprecated fromIvalue in RRefForkData
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/30646

Test Plan: Imported from OSS

Differential Revision: D18777610

Pulled By: mrshenli

fbshipit-source-id: 7a749c1035e36bbb464332d3829fd53e2c6cf727
2019-12-03 09:01:40 -08:00
b446572997 TestCppExtension now removes /tmp/torch_extensions folder so that it can be used by other users in a multi-user environment. (#30095)
Summary:
Previous behaviour: a user runs tests from `TestCppExtension` class so that `/tmp/torch_extensions` is created under her ownership and not removed afterwards,
then the other user's run of the same tests might result in 'Permission denied' exception upon deleting `/tmp/torch_extensions`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30095

Differential Revision: D18770234

Pulled By: ezyang

fbshipit-source-id: 4c6b972e4c4327a94c8b4bf6b0b9998a01c218bb
2019-12-03 07:44:27 -08:00
8b29701ae5 Turn off scalar_checks for _th_reciprocal. (#30436)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30436

The underlying TH implementation is correct.

Test Plan: Imported from OSS

Differential Revision: D18699088

Pulled By: gchanan

fbshipit-source-id: e75a588ae4afb0506922ba98208546d5c0de623a
2019-12-03 07:04:53 -08:00
61798865e3 Turn off scalar_checks for torch.clamp. (#30435)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30435

The underlying THC implementations are correct.

Test Plan: Imported from OSS

Differential Revision: D18699089

Pulled By: gchanan

fbshipit-source-id: f5d1319bf48eae36903296dad0b98ed80661f732
2019-12-03 07:04:47 -08:00
e5b947a3a8 Raise an error for is_signed on quantized types (#30527)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30527

When we introduced dtype.is_signed we allowed for support of
quantized types, but we're not sure what the correct result should be.

See discussion at https://github.com/pytorch/pytorch/pull/29511

Test Plan: Imported from OSS

Differential Revision: D18765410

Pulled By: nairbv

fbshipit-source-id: c87cfe999b604cfcbbafa561e04d0d5cdbf41e6d
2019-12-03 06:34:53 -08:00
18ec4632b3 Exclude undefined tensors in the result of Module::parameters() / named_paramters() / buffers() / named_buffers() (#30626)
Summary:
PR https://github.com/pytorch/pytorch/pull/30523 attempted to fix https://github.com/pytorch/pytorch/issues/30508 and https://github.com/pytorch/pytorch/issues/30462, but the fix wasn't complete. This PR makes the following improvements:
1. Fixes https://github.com/pytorch/pytorch/issues/30508 and https://github.com/pytorch/pytorch/issues/30462 properly by excluding undefined tensors in the result of `Module::parameters()` / `named_parameters()` / `buffers()` / `named_buffers()`, which mirrors the Python API behavior.
2. Audits all use sites of `Module::parameters_` / `buffers_` and change them to `Module::named_parameters(/*recurse=*/false)` / `named_buffers(/*recurse=*/false)` when appropriate, so that use sites of module parameters / buffers never need to worry about undefined tensors.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30626

Differential Revision: D18777507

Pulled By: yf225

fbshipit-source-id: 55b64b69779e1186342efd3c44857f416334ed6b
2019-12-02 21:59:58 -08:00
e7fe64f6a6 Fix typos (#30606)
Summary:
Should be non-semantic.

Uses https://en.wikipedia.org/wiki/Wikipedia:Lists_of_common_misspellings/For_machines to find likely typos.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30606

Differential Revision: D18763028

Pulled By: mrshenli

fbshipit-source-id: 896515a2156d062653408852e6c04b429fc5955c
2019-12-02 20:17:42 -08:00
0bebfe2143 Add the explicit per-tensor/per-channel quant info when we print the module (#30591)
Summary:
As Title says. We would like to explicitly distinguish per-tensor/per-channel scheme when we print the module.

Here is an example for Lenet after applying the per-channel dynamic quantization:

Before this PR:
```
FloatModel(
  (conv1): Conv2d(1, 20, kernel_size=(5, 5), stride=(1, 1))
  (conv2): Conv2d(20, 50, kernel_size=(5, 5), stride=(1, 1))
  (fc1): DynamicQuantizedLinear(
    in_features=800, out_features=500
    (_packed_params): LinearPackedParams()
  )
  (fc2): DynamicQuantizedLinear(
    in_features=500, out_features=10
    (_packed_params): LinearPackedParams()
  )
)
```

After this PR:
```
FloatModel(
  (conv1): Conv2d(1, 20, kernel_size=(5, 5), stride=(1, 1))
  (conv2): Conv2d(20, 50, kernel_size=(5, 5), stride=(1, 1))
  (fc1): DynamicQuantizedLinear(
    in_features=800, out_features=500, qscheme=torch.per_channel_affine
    (_packed_params): LinearPackedParams()
  )
  (fc2): DynamicQuantizedLinear(
    in_features=500, out_features=10, qscheme=torch.per_channel_affine
    (_packed_params): LinearPackedParams()
  )
)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30591

Differential Revision: D18764366

Pulled By: jianyuh

fbshipit-source-id: e897ab42ace6b82b2a90729ba788313c7873de1a
2019-12-02 20:14:46 -08:00
4dab29a2bd Fix serialization memory lifetime issue. (#30603)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30603

Pickler object needs to be kept in scope until data is written out to the
final serialized string. tensorData in particular is a reference to memory
owned by the descoped Pickle object.

Noticed this by inspection. In practice, this potential read-after-free here
is limited to non-cpu tensors, and any such use was very soon after free.
ghstack-source-id: 94756036

Test Plan: existing test suite at buck test mode/dev-nosan caffe2/test:rpc_fork

Differential Revision: D18760463

fbshipit-source-id: 9de890d66626aa48f13ca376dd9bd50b92e0cb00
2019-12-02 20:10:28 -08:00
db81e13d6b Fix TCPStoreTest and improve tcputils::connect() (#30354)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30354

TCPStoreTest would timeout since the TCPStore constructor for the
server would block the main thread waiting for workers. The workers themselves
were spawned later on once the server store is created. As a result, this test
would always timeout.

To fix the test, I moved the server store to a thread so that the workers can
register with the server in parallel.

In addition to this made a few improvements to tcputils::connect. When
tcputils::connect() encountered an exception, it always looked at `errno` for
the error code. In some cases `errno` could be overwritten and the real error
code would be stored in `std::system_error`. As a result, I've modified the
code to look at the error code in `std::system_error` if we catch an exception
of that type.
ghstack-source-id: 94758939

Test Plan: waitforbuildbot

Differential Revision: D18668454

fbshipit-source-id: d5a3c57b066b094bfecda9a79d9d31bfa32e17f0
2019-12-02 19:52:34 -08:00
9e3d19412b Disable implicit conversion warning (#30529)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30529

We started to see build failures for multiple services with top-of-trunk LLVM compiler. The failures point to a warning that was treated as error for implicit conversion from long to double. Per discussion on D18642524, I'm disabling this warning from the containing TARGET file. T58053069 opened for code owner to track this - a proper source code fix and more unit test is needed.

Test Plan: local build, sandcastle

Reviewed By: smessmer

Differential Revision: D18668396

fbshipit-source-id: 28c0ff3258c5ba3afd41a0053f9fe1b356a496a8
2019-12-02 18:30:03 -08:00
968c0d4a46 Add support for converting quantized AvgPool2d and Reshape operations (#30490)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30490

Add symbolic mapping to Int8AvgPool2d and Int8Reshape op in C2

Test Plan:
python test/onnx/test_pytorch_onnx_caffe2_quantized.py TestQuantizedOps

Imported from OSS

Differential Revision: D18740520

fbshipit-source-id: 1606125500c4b549fbc984e7929b7fd5204396a0
2019-12-02 18:15:01 -08:00
2d0a4e42e9 Add barriers to fix flaky test_graph_for_py_nested_call and (#30624)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30624

These tests were flaky since we would end up calling the 'verify'
methods before some of the RPCs were done. The `check_rpc_done` function might
not guarantee this since set_rpc_done sets an appropriate flag in python which
causes `check_rpc_done` to pass. Although, there are a few steps after that
like attaching the send functions for the response of the RPC that might not
have executed by then.
ghstack-source-id: 94781954

Test Plan: Run the tests 100 times.

Reviewed By: zhaojuanmao

Differential Revision: D18768786

fbshipit-source-id: a14c3f4b27de14fe5ecc6e90854dc52652f769b8
2019-12-02 18:12:28 -08:00
98ab55fc51 PRAGMA missing for clang (#30351)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30351

Not sure what proper fix is, clang is having trouble with the loop pragmas. This at least gets things compiling.
ghstack-source-id: 94458450

Test Plan: CI passes

Differential Revision: D18665812

fbshipit-source-id: b8a899ce4138010cbe308eaa2c0838dd9e15573f
2019-12-02 17:50:22 -08:00
9c02b88791 Add pickler support for Device (#30131)
Summary:
This PR adds (un)pickling support for `c10::Device`. It also adds `torch.device` as a type annotation for device attributes.
](https://our.intern.facebook.com/intern/diff/18664421/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30131

Pulled By: driazati

Differential Revision: D18664421

fbshipit-source-id: 64378fb42b2d1bbe2bd86259e5ed10f24b5d1e49
2019-12-02 17:43:08 -08:00
19b7d49fac Add TOC to CONTRIBUTING.md (#29671)
Summary:
This TOC is manually generated but `CONTRIBUTING.md` seems like its
stable enough for that to be okay
Pull Request resolved: https://github.com/pytorch/pytorch/pull/29671

Pulled By: driazati

Differential Revision: D18771604

fbshipit-source-id: 0d6c9c6cf1083d3be413219d3cead79c2fe5050b
2019-12-02 16:47:59 -08:00
569729527b Turn off scalar_checks for exp, cos, cosh, tan, atan, tanh, erf, erfc. (#30434)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30434

These are all pointwise ops that are implemented correctly wrt shapes in THC.

Test Plan: Imported from OSS

Differential Revision: D18699087

Pulled By: gchanan

fbshipit-source-id: 82cb91b00c77bfaca75be497c87fc7ae52daf46c
2019-12-02 16:10:25 -08:00
9082123038 Back out "Back out "Revert D18542342: Boxed variable dispatch""
Summary: Original commit changeset: 7f3e32a6ee0c

Test Plan: waitforsandcastle

Reviewed By: ezyang

Differential Revision: D18766763

fbshipit-source-id: 51bb7aac7cb7ce3df94681e838949e7a156e3ad9
2019-12-02 16:06:36 -08:00
3636cb0364 windows build (#30556)
Summary:
based on https://github.com/pytorch/pytorch/pull/28677
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30556

Differential Revision: D18764040

Pulled By: mingbowan

fbshipit-source-id: 53104636800f5887b74a82c154bc5e9603de9322
2019-12-02 14:54:22 -08:00
d32f261f16 make the order btw div and mul in adagrad update consistent (#30449)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30449

There was an inconsistency in the order of operation between scalar and SIMD code when we compute Adagrad.
In this diff we first compute effective_lr = lr / (sqrt(moment) + epsilon) and then multiply with gradient.

Test Plan: CI

Reviewed By: protonu

Differential Revision: D18703416

fbshipit-source-id: 2a8b2a3f5401466549561412bd22f07abac3c598
2019-12-02 13:53:38 -08:00
1111a6b810 Use pybind11::gil_scoped_* functions instead of AutoGIL/AutoNoGIL (#30274)
Summary:
Reland of https://github.com/pytorch/pytorch/pull/29095
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30274

Differential Revision: D18762293

Pulled By: ezyang

fbshipit-source-id: d3d50c2dd12bcb678ab25fa708eb6587cc4b66f9
2019-12-02 12:19:58 -08:00
6deb41c88d Update magma to 2.5.1 for Windows and switch CUDA in CI to 9.2
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/30513

Differential Revision: D18764184

Pulled By: ezyang

fbshipit-source-id: 4992869fd6a89471a5d25eb6a9b44ad8eceb480f
2019-12-02 11:56:10 -08:00
b68d1fc316 add small input shapes to some ops (#30617)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30617

as title

Test Plan: buck run //caffe2/benchmarks/operator_benchmark:benchmark_all_test -- --iterations 1 --operator add,as_strided,cat,chunk,fill,linear,matmul,split

Reviewed By: hl475

Differential Revision: D18764248

fbshipit-source-id: 510cf83542822acfa1b7b5e475b0cc7432f7ac19
2019-12-02 10:46:43 -08:00
8ee61e0be4 Fix CPU_INTEL flag error on windows (#30564)
Summary:
${CMAKE_HOST_SYSTEM_PROCESSOR} get processor name by `uname -p` on linux and `%PROCESSOR_ARCHITECTURE%` on windows
1. %PROCESSOR_ARCHITECTURE% has value in (AMD64|IA64|ARM64) for 64-bit processor, and (x86) for 32-bit processor
2. `uname -p` has value like "(x86_64|i[3-6]+86)"
We cannot tell intel cpu from other cpus by ${CMAKE_HOST_SYSTEM_PROCESSOR}. It is the architecture, not provider.
i. e. Intel CPU i7-9700K CPU on windows get "AMD64"

reference:
[MSDN](https://docs.microsoft.com/zh-cn/windows/win32/winprog64/wow64-implementation-details?redirectedfrom=MSDN)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30564

Differential Revision: D18763031

Pulled By: ezyang

fbshipit-source-id: 11ae20e66b4b89bde1dcf4df6177606a3374c671
2019-12-02 08:43:01 -08:00
e6000a7c04 Temporarily disable test_numerical_consistency_per_tensor (#30600)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30600

test_numerical_consistency_per_tensor in test_fake_quant is failing on Windows.
ghstack-source-id: 94742124

Test Plan: CircleCI tests

Differential Revision: D18760287

fbshipit-source-id: 7f59355eab74e811bb370ad2836ed2f1def1f621
2019-12-02 06:57:14 -08:00
c780610f2d Disable test_backward_per_tensor in test_fake_quant (#30594)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30594

This testcase started breaking, clean up for the build.
ghstack-source-id: 94736837

Test Plan: Unittest disabling change

Differential Revision: D18758635

fbshipit-source-id: 05df1158ff0ccd75e401f352da529fb663b1cae0
2019-12-01 22:26:28 -08:00
53785771a7 Don't build test_cpp_rpc if torch is built without distributed support (#30587)
Summary:
On the latest master, I get link errors when building one of the tests:

```sh
/home/pbell/git/pytorch/build/../test/cpp/rpc/test_wire_serialization.cpp:23:
undefined reference to `torch::distributed::rpc::wireDeserialize(void const*, unsigned long)'
```

This seems to be caused by PR https://github.com/pytorch/pytorch/issues/29785 not working with `USE_DISTRIBUTED=0`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30587

Differential Revision: D18758625

Pulled By: jjlilley

fbshipit-source-id: 0ad0703acdbbac22bb4b8317370fbe2606fcb67e
2019-12-01 16:43:12 -08:00
dd52f50fc8 Add examples to RRef doc
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/30516

Test Plan: Imported from OSS

Differential Revision: D18728183

Pulled By: mrshenli

fbshipit-source-id: af472ebed0e6dd0a85653b080abd3ac4d482bd26
2019-11-28 15:34:26 -08:00
30d70d5378 Make doc source format consistent in rpc/init.cpp
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/30515

Test Plan: Imported from OSS

Differential Revision: D18728184

Pulled By: mrshenli

fbshipit-source-id: 7b643c7f8225943113fbd7130ff6aadb30c1d4e9
2019-11-28 15:34:22 -08:00
ec5e471647 Reorganize rpc API doc and add introduction (#30491)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30491

Our RPC API docs presents the APIs well but misses a general
introduction to the APIs. Readers might be a little lost the first
time landing this page. This commits reorganizes the APIs into
four components from user's perspective, RPC, RRef, dist autograd,
and dist optimizer. It also adds an intro to each and briefly
discribes why we provide those.

Test Plan: Imported from OSS

Differential Revision: D18723294

Pulled By: mrshenli

fbshipit-source-id: 4aced4ab537b070aa780aaaf9724659fd47cb3cb
2019-11-28 15:34:18 -08:00
f4e7e9039d Improve process_group_agent() serialization speed (#29785)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/29785

TLDR: This change improves process_group's serialization speed:
  Serialize_Tensor64:     12.38us ->   1.99us  (~-84%)
  Deserialize_Tensor64:   33.89us ->   5.62us  (~-84%)
  Serialize_Tensor1M:    525.74us -> 285.43us  (~-45%)
  Deserialize_Tensor1M:  892.61us -> 273.68us  (~-70%)

After speaking with the jit team, we had consensus that torch::save()/load()
are somewhat high-overhead for RPC serialization, mostly intended for
persistent disk data.

(Particularly, for large tensors, 35% of the time is spent in CRC checking, even
with the fb-side changes to subsitute 40x faster SSE-accelerated crc checking;
Also, for small tensors, the zip container overhead is considerable, as is the
overhead of lexing/parsing an embedded text python program for each RPC).

The jit team encouraged us to use jit::pickler, with the WriteableTensorData
way of outputting result tensors (not the default side-tensor table, or
with pickling the actual tensors). This ends up just pickling some tensor
metadata, and giving us some tensor blobs that we can mindlessly
blit over the wire (they copy to cpu memory if needed).

There is yet no standardized container format for the pickled data
(there is jit::pickle_save() checked in, but but it's experimental,
no load function is yet provided), but they encouraged us to just use
something sensible for this, and possibly revisit later. For now, I made
the directory headers slightly http-inspired.

Note that serialization is just one component of the pipeline, but that
said, we also see reasonable reductions in end-to-end echo times (noisier):
   ProcessGroupAgent_Echo(Tensor_Small)   855.25us -> 492.65us  (~-42%)
   ProcessGroupAgent_Echo(Tensor_1M)       10.82ms -> 6.94ms    (~-35%)
   ProcessGroupAgent_Echo(Small_NoTensor) 688.82us -> 301.72us  (~-56%)
   ProcessGroupAgent_Echo(1MB_NoTensor)     4.65ms -> 3.71ms    (~-20%)

I moved the "wire serialization" logic to a separate file to assist with
unittesting.
ghstack-source-id: 94694682

Test Plan:
buck test mode/dev-nosan caffe2/test/cpp/api:serialize
  buck test mode/dev-nosan caffe2/test/...

Differential Revision: D18493938

fbshipit-source-id: 07ddfe87dbe56472bc944f7d070627052c94a8f4
2019-11-28 09:57:52 -08:00
1350b99de4 Add local shutdown to process group agent (#30330)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30330

This is now possible due to previous changes made in `gloo` and `ProcessGroupGloo`. We `abort` the listener thread that is waiting for a message, and join all other threads. The API is changed so that the previous `wait_all_workers` does not destroy the agent, and this is now done in a new `shutdown` method. All callsites are updated appropriately.

ghstack-source-id: 94673884
ghstack-source-id: 94673884

Test Plan: Unit tests pass.

Reviewed By: mrshenli

Differential Revision: D18661775

fbshipit-source-id: 5aaa7c14603e18253394224994f6cd43234301c2
2019-11-27 22:34:08 -08:00
7ac8efa689 Skip undefined tensors when moving torch::nn module to a different device (#30523)
Summary:
This fixes high-pri issues such as https://github.com/pytorch/pytorch/issues/30508 and https://github.com/pytorch/pytorch/issues/30462.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30523

Differential Revision: D18732904

Pulled By: yf225

fbshipit-source-id: fe5a7a43838000f5803bd9c01ecfba0c3f02df5d
2019-11-27 21:21:02 -08:00
640109ae5d Back out "Revert D18542342: Boxed variable dispatch"
Summary: Original commit changeset: 082992125447

Test Plan: waitforsandcastle

Reviewed By: akinh

Differential Revision: D18737627

fbshipit-source-id: 7f3e32a6ee0c330002ae7fdcc8a35e8b540bb4db
2019-11-27 17:39:09 -08:00
87f29557bd Ignore logical_and and logical_or in op BC check for now (#30537)
Summary:
Get the CI happy.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30537

Reviewed By: hl475

Differential Revision: D18738567

Pulled By: houseroad

fbshipit-source-id: f30a87e22653b83ebdb1b54851460ec245866ecf
2019-11-27 16:59:37 -08:00
a2ed50c920 Revert D17908478: Switch PyTorch/Caffe2 to C++14
Test Plan: revert-hammer

Differential Revision:
D17908478

Original commit changeset: 6e340024591e

fbshipit-source-id: 775d2e29be0bc3a0db64f164c8960c44d4877d5d
2019-11-27 14:57:05 -08:00
0b25371f5d Turn off scalar_check for _th_normal.
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/29955

Test Plan: Imported from OSS

Differential Revision: D18548051

Pulled By: gchanan

fbshipit-source-id: c652999ac9e37d2592aa85ef022040fe0700b5cf
2019-11-27 14:52:06 -08:00
f3631c2464 Revert D18542342: Boxed variable dispatch
Test Plan: revert-hammer

Differential Revision:
D18542342

Original commit changeset: a30ae35d98f8

fbshipit-source-id: 082992125447c814c90f7934fadf00995e146e0e
2019-11-27 14:01:40 -08:00
7d2b0aa693 add retries to network operations (curl, conda install, git clone) (#30479)
Summary:
Addresses some of the top network-related flakiness occurrences.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30479

Differential Revision: D18736386

Pulled By: kostmo

fbshipit-source-id: 9eb5dca0cd0281894a0b304fbaf59a0341d3ff58
2019-11-27 13:58:15 -08:00
c1c5622a6a Add katex to pytorch-linux-xenial-py3.6-gcc5.4 docker image (#30522)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30522

This is in preparation for moving the docs push CI jobs to depend on
`pytorch-linux-xenial-py3.6-gcc5.4` rather than
`pytorch-linux-xenial-cuda9-cudnn7-py3`.

Test Plan: Imported from OSS

Differential Revision: D18731108

Pulled By: zou3519

fbshipit-source-id: fd753a5ca818fa73a14e4276c33368a247cc40e1
2019-11-27 12:41:58 -08:00
a69be8123a Use gettimeofday on iOS (#30361)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30361

### Summary

By default, the compiler will choose `clock_gettime` for the iOS build. However, that API is not available until iOS 10. Since the Facebook app still supports iOS 9.0,  we have to use `gettimeofday` instead.

```shell
xplat/caffe2/torch/csrc/autograd/profiler.h:86:3: error: 'clock_gettime' is only available on iOS 10.0 or newer [-Werror,-Wunguarded-availability]

xplat/caffe2/torch/csrc/autograd/profiler.h:86:17: error: '_CLOCK_MONOTONIC' is only available on iOS 10.0 or newer [-Werror,-Wunguarded-availability]
```

P.S. the open-sourced version is iOS 12.0 and above, so we don't have this problem.

### Test Plan

- buck build works
- Don't break CIs

Test Plan: Imported from OSS

Differential Revision: D18730262

Pulled By: xta0

fbshipit-source-id: fe6d954b8d3c23cbc9d1e25a2e72e0b0c1d4eaa9
2019-11-27 11:48:41 -08:00
2f42488d36 Updating submodules
Summary:
GitHub commits:

64dc8e79e9
3b2aa3c218
dc6c17ca9e
4508ea4e06
6150034ff3
12b7a89a4b
9befbe9b40
2fd96cc070
68bf04ce46
19bd96d453
7229ad4fd7
b2bb2b465b
4c65c9023d

Test Plan: n/a

Reviewed By: 2d2d2d2d2d

fbshipit-source-id: e7dc6a4ebafdc6a01aff89f4038f5679ed6e7011
2019-11-27 11:44:54 -08:00
106ab487eb fix typo in doc
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/30518

Differential Revision: D18729361

Pulled By: albanD

fbshipit-source-id: 4e386b99e898b9cd8f9a21dff642d0f40355899f
2019-11-27 11:19:13 -08:00
fcb7371e65 Update docs for cpp_extension on Windows (#30392)
Summary:
Targets https://github.com/pytorch/pytorch/issues/30379.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30392

Differential Revision: D18730438

Pulled By: albanD

fbshipit-source-id: f718d006ee8aaaa356c1e15e53a0469f15e8ed41
2019-11-27 10:56:29 -08:00
d0acc9c085 Switch PyTorch/Caffe2 to C++14 (#30406)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30406

ghstack-source-id: 94642238

Test Plan: waitforsandcastle

Differential Revision: D17908478

fbshipit-source-id: 6e340024591ec2c69521668022999df4a33b4ddb
2019-11-27 10:47:31 -08:00
ec5c08de74 Revert D18580867: Add logical_and and logical_or
Test Plan: revert-hammer

Differential Revision:
D18580867

Original commit changeset: 7e4d7c37da4d

fbshipit-source-id: 81fb604c7aef8d847f518f5faa016e7bd0423016
2019-11-27 09:27:00 -08:00
1e8ed021c6 Support logsoftmax with dim != -1 (#30433)
Summary:
PyTorch dim and ONNX axis have different meanings.
ONNX only supports log_softmax with dim = -1. Transpose must be added before and after log_softmax to support other cases.
This requires input rank to be known at export time.
Fixes https://github.com/pytorch/pytorch/issues/17918
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30433

Reviewed By: hl475

Differential Revision: D18723520

Pulled By: houseroad

fbshipit-source-id: d0ed3b3f051d08d46495a7abfa854edd120dca3a
2019-11-27 08:34:38 -08:00
0282c5ae69 Add helper to aggregate multiple process groups (#25768)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/25768

The round robin process group can be constructed from multiple other
process groups. Every collective call against this new process group
is delegated to the specified process groups in a round robin fashion.

Doing so may benefit performance when calling into multiple NCCL
process groups. Instead of adding support for round-robin usage of
NCCL communicators, we achieve the same without changing the NCCL
process group and adding this wrapper class.

The API to create this round robin process group is a bit harsh. If we
find it adds significant benefit we can revisit and make this a first
class citizen in the torch.distributed module.
ghstack-source-id: 94578376

Test Plan: The newly added test passes.

Reviewed By: chenyangyu1988

Differential Revision: D17226323

fbshipit-source-id: ec9f754b66f33b983fee30bfb86a1c4c5d74767d
2019-11-27 08:34:34 -08:00
1d3f3a1a0c Add pybind11 trampoline class for c10d.Store (#30415)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30415

This enables subclassing of c10d.Store and implementing its interface in Python.
ghstack-source-id: 94586627

Test Plan: New tests passes.

Reviewed By: vladbelous

Differential Revision: D18693018

fbshipit-source-id: fa1eba4bd11cc09a3d6bf3f35369c885033c63c0
2019-11-27 08:34:29 -08:00
d2336edcfb Boxed variable dispatch (#29934)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/29934

Previously, when doing boxed dispatch (e.g. custom ops), the dispatcher manually removed the VariableTensorId flag before dispatching
because custom ops don't have variable kernels.
This is one of the blockers that prevented us from using the boxed dispatch mechanism for ops from native_functions.yaml because they define variable kernels and need them to be called for autograd.

This PR changes that. The dispatcher doesn't remove the VariableTensorId flag anymore.
Instead, to make custom ops work, we implement a variable fallback kernel that is called whenever no other variable kernel was found.
ghstack-source-id: 94618474

Test Plan: unit tests

Differential Revision: D18542342

fbshipit-source-id: a30ae35d98f89f7ae507151f55c42cfbed54a451
2019-11-27 08:34:25 -08:00
512c2a2df5 Enable constant folding (#29834)
Summary:
Set default do_constant_folding = True
Pull Request resolved: https://github.com/pytorch/pytorch/pull/29834

Reviewed By: hl475

Differential Revision: D18588037

Pulled By: houseroad

fbshipit-source-id: b35c06161321629c886e177ea666eff31cebf06a
2019-11-27 08:34:20 -08:00
c1c8105de0 Make the warning of using SparseTensor in JIT less noisy
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/30499

Test Plan: waitforsandcastle

Reviewed By: wanchaol

Differential Revision: D18705553

fbshipit-source-id: d6e16e3285a74a1c031a5312f7a690f1baf392f8
2019-11-27 08:34:16 -08:00
829499e626 avoid Formatting::print() when STRIP_ERROR_MESSAGES is set (#30451)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30451

TORCH_CHECK takes __VA_ARGS__ so there is no need to concatenate strings
before calling it. This way it won't call Formatting::print() on the
tensor when STRIP_ERROR_MESSAGES macro is set. Formatting::print() calls
several specific tensor methods that brings in unnecessary inter-op
dependencies for static code analysis.

Test Plan: - builds

Differential Revision: D18703784

Pulled By: ljk53

fbshipit-source-id: 1c0628e3ddcb2fd42c475cb161edbef09dfe8eb5
2019-11-26 17:38:45 -08:00
2d6b2f39e9 Fix docs so that the example works (#30120)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30120

The example given for functional conv2d didn't work. This diff fixes the example in docs so that it works.

Fixes https://github.com/pytorch/pytorch/issues/29649
ghstack-source-id: 94601559

Test Plan: Tried the example locally

Differential Revision: D18604606

fbshipit-source-id: ff1a4f903e2843efe30d962d4ff00e5065cd1d7e
2019-11-26 17:38:40 -08:00
5ada5363fc GenericDict/List type use unshapedType() (#30428)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30428

Reported issue https://discuss.pytorch.org/t/incomprehensible-behaviour/61710

Steps to reproduce:

```
class WrapRPN(nn.Module):
    def __init__(self):
        super().__init__()

    def forward(self, features):
        # type: (Dict[str, Tensor]) -> int
        return 0
```

```
#include <torch/script.h>

int main() {
  torch::jit::script::Module module = torch::jit::load("dict_str_tensor.pt");

  torch::Tensor tensor = torch::rand({2, 3});
  at::IValue ivalue{tensor};
  c10::impl::GenericDict dict{c10::StringType::get(),ivalue.type()};
  dict.insert("key", ivalue);
  module.forward({dict});
}
```

ValueType of `c10::impl::GenericDict` is from the first specified element as `ivalue.type()`
It fails on type check in` function_schema_inl.h` !value.type()->isSubtypeOf(argument.type())
as `DictType::isSubtypeOf` requires equal KeyType and ValueType, while `TensorType`s are different.

Fix:
Use c10::unshapedType for creating Generic List/Dict

Test Plan: Imported from OSS

Differential Revision: D18717189

Pulled By: IvanKobzarev

fbshipit-source-id: 1e352a9c776a7f7e69fd5b9ece558f1d1849ea57
2019-11-26 17:38:36 -08:00
6bd8937aee FunctionParameter::set_default_str replace || with &&
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/30471

Test Plan: Imported from OSS

Differential Revision: D18710958

Pulled By: pbelevich

fbshipit-source-id: 7e5339175c7e16cd975a90bf6b123df728045e4d
2019-11-26 17:38:31 -08:00
21d7532dfe Add more comment on NumPy detection in Python scripts.
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/30417

Differential Revision: D18716502

Pulled By: albanD

fbshipit-source-id: 0b1b86f882e0e24cb6845e4a44708048e7e3b4a8
2019-11-26 17:38:27 -08:00
8bbafa0b32 Add logical_and and logical_or (#28162)
Summary:
Superseding https://github.com/pytorch/pytorch/issues/24379 as type promotion has been implemented.

Close https://github.com/pytorch/pytorch/issues/24379
Pull Request resolved: https://github.com/pytorch/pytorch/pull/28162

Differential Revision: D18580867

Pulled By: ailzhang

fbshipit-source-id: 7e4d7c37da4dc8df87314bd4f1f6a7539e46586a
2019-11-26 17:38:22 -08:00
92e27c5e89 Flag to disable Variable
Summary:
using `buck build mode/opt mode/no-gpu //experimental/ngimel/benchmark_framework_overheads:cpp_benchmark`

```
devvm497.prn3.facebook.com:/data/users/bwasti/fbsource/fbcode $ ./cpp_benchmark --niter 10000
creating inputs, number of dimensions 1
starting op
benchmarking 10000 iterations
using cpp frontend
elapsed time per iteration 0.90638 us
```

```
devvm497.prn3.facebook.com:/data/users/bwasti/fbsource/fbcode $ ./cpp_benchmark --niter 10000 --disable_variable_dispatch
creating inputs, number of dimensions 1
starting op
benchmarking 10000 iterations
using cpp frontend
elapsed time per iteration 0.775436 us
```

Test Plan: let all tests run

Reviewed By: smessmer

Differential Revision: D18654276

fbshipit-source-id: 362812b2c87ec428448b2ac65baac45f492fdce4
2019-11-26 17:38:18 -08:00
4eff2f2007 Fix missing closing quotes in docs
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/30448

Differential Revision: D18711396

Pulled By: zou3519

fbshipit-source-id: 6e35e0779716185791273eedca7a93667a6cda90
2019-11-26 17:38:13 -08:00
05a1644ce3 Fix BC for quantized linear
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/30481

Test Plan: Imported from OSS

Differential Revision: D18714602

Pulled By: jamesr66a

fbshipit-source-id: d51206c22cf2446e98053446789c6324c0481321
2019-11-26 17:38:09 -08:00
976d91d30a Comment on a set of ops bound at the python layer
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/30420

Test Plan: Imported from OSS

Reviewed By: suo

Differential Revision: D18713999

Pulled By: eellison

fbshipit-source-id: 3a8d6e4431cbfe6a78ca047217c1c53c47403841
2019-11-26 17:38:04 -08:00
634f370c63 Add comment to ops bound at python layer
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/30419

Test Plan: Imported from OSS

Reviewed By: suo

Differential Revision: D18714000

Pulled By: eellison

fbshipit-source-id: 22ccb941b2db24031921f378c600e68fe70e1346
2019-11-26 17:37:59 -08:00
c5a6c4d6c9 Adding elementwise kernel also operating on index (#28175)
Summary:
This PR add `gpu_kernel_with_index` as an addition to element-wise kernel template. It allows kernel to not only operate on input tensor value, but also each values index(view as 1d, so from 0 to numel) within the lambda.
Direct use case here is to replace thrust::tabulate used in range/arange/linspace. Benifits are:
- thrust::tabulate causes additional unneccessary synchronization on cpu.
- Now it works with tensor iterator, output no longer needs to be contiguous and a memcpy is saved

It can also potentially be reused to add new function to pytorch later, if we see use case both value and index is needed.(for example unify tril/triu into tensor iterator element-wise? add other pattern?)

Known issues:
https://github.com/pytorch/pytorch/pull/23586 is needed to enable non-contiguous case work properly, since overlapping needs to be checked. Currently non-contiguous tensor falls into TOO_HARD. I could write proper check in this file but I figured using exist method is better. jjsjann123
It does not work beyond 32bit indexing. But thrust was erroring on those case too. We could split tensor in caller to enable this. Index changes after split, so it is easier for caller to pass different lambda, and harder for the template to handle it in general.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/28175

Differential Revision: D18708649

Pulled By: ngimel

fbshipit-source-id: 382081c96f266ae7b61095fc1f2af41c6b210fa9
2019-11-26 17:37:55 -08:00
e9cc4a5942 Add @DoNotStrip to nativeNewTensor method. (#30472)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30472

Add DoNotStrip to nativeNewTensor method.
ghstack-source-id: 94596624

Test Plan:
Triggered build on diff for automation_fbandroid_fallback_release.

buck install -r fb4a

Tested BI cloaking using pytext lite interpreter.

Obverse that logs are sent to scuba table:

{F223408345}

Reviewed By: linbinyu

Differential Revision: D18709087

fbshipit-source-id: 74fa7a0665640c294811a50913a60ef8d6b9b672
2019-11-26 12:16:33 -08:00
fec903ce00 Fix test case after get_qparams refactor (#30470)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30470

att

Test Plan:
python test/test_jit.py

Imported from OSS

Differential Revision: D18710775

fbshipit-source-id: b1c7c0afbc538ff1d3e19c5d3d6bd425e4f94f06
2019-11-26 12:16:29 -08:00
b0871f211b Make all optimizers consistent so that they don't change gradients inplace
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/30257

Test Plan: Imported from OSS

Differential Revision: D18665461

Pulled By: albanD

fbshipit-source-id: cfdafef919468a41007881b82fd288b7128baf95
2019-11-26 12:16:25 -08:00
45880f4246 Change logging to remove the word "error" from info log
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/30468

Reviewed By: xianjiec

Differential Revision: D18702959

fbshipit-source-id: a777445bea735dce89182dd95f38907963fab556
2019-11-26 12:16:21 -08:00
dcd9f49809 Specify ordering on singular values and eigenvalues output from torch… (#30389)
Summary:
….svd/symeig respectively

Changelog:
- Adds a note to docstrings of the both functions specifying the ordering

Fixes https://github.com/pytorch/pytorch/issues/30301
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30389

Differential Revision: D18707608

Pulled By: zou3519

fbshipit-source-id: b0f73631578f39a24fae9af4997c6491de8be9a8
2019-11-26 10:23:47 -08:00
dbce53fe32 Turn off scalar_check for _th_gather. (#29954)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/29954

The underlying op handles scalar_check correctly.

Test Plan: Imported from OSS

Differential Revision: D18548054

Pulled By: gchanan

fbshipit-source-id: a1b44afa80c2928b78abbfba8b8b5d3608ac0fd3
2019-11-26 10:23:42 -08:00
72ac45662b Turn off scalar_checks for torch.take. (#29953)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/29953

The underlying function handles it correctly.

Test Plan: Imported from OSS

Differential Revision: D18548055

Pulled By: gchanan

fbshipit-source-id: cc2d0ae37d9689423363d115c6a653cb64840528
2019-11-26 10:23:37 -08:00
79a830af56 Turn off scalar_check for Tensor.set_(Tensor) (#29952)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/29952

The underlying op handles the check correctly.

Test Plan: Imported from OSS

Differential Revision: D18548048

Pulled By: gchanan

fbshipit-source-id: 9ac6fde743408e59ccdfc61bd574ebe6e2862238
2019-11-26 10:23:33 -08:00
0febff36ac Export dynamic unbind/split and __getitem__ (#29136)
Summary:
In ONNX opset 11, a series of sequence ops were added. Operators that are related to Tensor[] in PyTorch can be exported using these sequence ops.
In this PR, unbind/split that produces Tensor[], and __getitem__ that takes Tensor[] as input, are exported correctly to ONNX opset 11.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/29136

Reviewed By: hl475

Differential Revision: D18309222

Pulled By: houseroad

fbshipit-source-id: be12c96bf8d0a56900683ef579f1c808c0a1af21
2019-11-26 06:54:06 -08:00
2599b9b551 Add output_size argument to caffe2 Int8ResizeNearest (#30202)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30202

Pytorch Upsample operator has output_size as an argument.
For quantized tensor inputs we cannot get the input_size to calculate the width and height scale factor.
Instead we pass the output_size directly to caffe2 to calculate the scale factors.

Test Plan:
python test/onnx/test_pytorch_onnx_caffe2_quantized.py TestQuantizedOps.test_upsample

Imported from OSS

Differential Revision: D18631478

fbshipit-source-id: 38a39129bc863f4ecf2293acc068e40ab7edc825
2019-11-26 06:54:02 -08:00
efe1859ad9 By default ignore RRef leaks during shutdown (#30217)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30217

Before this commit, RRefContext throws an error if it detects any
RRef leak during shutdown. However, this requires applications to
make sure that is has freed all references to RRefs in application
code, which can be a bad debugging experience when for large
applications. Besides, this also relies on Python GC to free things
up in time, which might not always be true. After this commit,
RRefContext would ignore leaking RRefs during shutdown, as shutdown
is called when the application has finished training and no longer
care about local states. Hence, it should be OK to just ignore
those leaks and destroy OwnerRRefs. If application would like to
enforce no leaks, just set torch.distributed.rpc.api._ignore_rref_leak
to False.

Test Plan: Imported from OSS

Differential Revision: D18632546

Pulled By: mrshenli

fbshipit-source-id: 2744b2401dafdd16de0e0a76cf8e07777bed0f38
2019-11-26 06:53:58 -08:00
06db5ad707 Provide names for operator nodes in ONNX exported graph. (#27342)
Summary:
The PyTorch exporter does not add any name to the ONNX operators in the exported graph. A common request is to add names to op nodes by default. This helps the readability of the graph in visualization tools such a Netron, or when the ONNX graph is printed as a string. Also, it helps with the debuggability of the ONNX graph.

Therefore this PR adds name to operators in the exporters. The names follow a simple format, <op_type>_<index>. Expect files for tests in `test/onnx/test_operators.py` have been updated.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/27342

Reviewed By: hl475

Differential Revision: D17790979

Pulled By: houseroad

fbshipit-source-id: 1eaae88b5f51f152735a2ff96e22827837e34d9d
2019-11-26 06:53:53 -08:00
584be86c3f Try exporting ONNX with force_outplace=False (#29466)
Summary:
This should resolve https://github.com/pytorch/pytorch/issues/29008. This flag has two effects on the tracer.
- Remove the underscroll for inplace operators. E.g.: index_put_ ==> index_put. This is handled in utils.py separately as well.
- Add out as input for backward computation.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/29466

Reviewed By: hl475

Differential Revision: D18422815

Pulled By: houseroad

fbshipit-source-id: 317b6a3c8a5751fe6fe49d7543e429d281ed0d6d
2019-11-26 06:53:49 -08:00
eccf42fd15 Bug fix: Handle missing keys in observer state dict during load (#30357)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30357

Fix issue https://github.com/pytorch/pytorch/issues/29032 in loading from state dict for observers and fake quant.
ghstack-source-id: 94468814

Test Plan: Ensures that load/save of fake quant and observers with missing keys works correctly.

Differential Revision: D18668517

fbshipit-source-id: 0eda6f47c39102e55977fc548b9a03664f123ad7
2019-11-26 06:53:45 -08:00
ab5774547a Add info about transitive dependencies in case of using local aars (#30128)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30128

Preview: https://github.com/pytorch/pytorch/tree/gh/IvanKobzarev/23/head/android

Based on users issue: https://discuss.pytorch.org/t/android-somethings-went-wrong-with-pytorch-android-1-4-0-snapshot/61009/3

Test Plan: Imported from OSS

Differential Revision: D18702658

Pulled By: IvanKobzarev

fbshipit-source-id: 14928baccd58ddbe633fad03038271d8333c4b49
2019-11-26 06:53:40 -08:00
085dde5965 Fix for when PyTorch model trace has RecursiveScriptModules (#30430)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30430

When a module isn't a TracedModule, attempt to get name information with `original_name` property on module and default to 'Module' when no such property exists.

Test Plan:
### Change child module to scripted module:
```
model = torchvision.models.alexnet()
model.classifier = torch.jit.script(model.classifier)
```
### Add graph
```
w = SummaryWriter()
w.add_graph(model, torch.rand((2, 3, 224, 224)))
w.close()
```
### No errors
However, graph is disconnected at parts and hard to understand.
{F223327878}

Reviewed By: sanekmelnikov

Differential Revision: D18690836

fbshipit-source-id: 42295d06b7c1d48d5401776dca1e0d12cd64b49d
2019-11-26 06:53:35 -08:00
8199596d7e Add missing std::move (#30411)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30411

-
ghstack-source-id: 94526555

Test Plan: unit tests

Differential Revision: D18690385

fbshipit-source-id: fd348c0887c279694c2f6d287b361c8e07f02ffb
2019-11-26 06:53:31 -08:00
661a6c8ef2 Add get_qparams and revert the changes to calculate_qparams (#30262)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30262

`get_qparams` returns all parameters that's needed to call quantize function

Test Plan:
python test/test_jit.py

Imported from OSS

Differential Revision: D18645047

fbshipit-source-id: e57c11a66dac2d589778d412a996796ad5b6f86a
2019-11-26 06:53:26 -08:00
46e7f31fa3 Document unsupported types (#30344)
Summary:
This adds a listing of the parts of the `typing` module that are unsupported

This is also a first pass decisions on features are 'unlikely to be implemented' vs 'not implemented' so they're open to discussion
](https://our.intern.facebook.com/intern/diff/18665628/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30344

Pulled By: driazati

Differential Revision: D18665628

fbshipit-source-id: 22b8ebbde23df03839306cdb4344ca18a44f2c29
2019-11-26 06:53:22 -08:00
ab2ec4d835 Fix inexistent parameter in document (#24335)
Summary:
There is no `out` argument to `argsort` according to the source code.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/24335

Differential Revision: D16829134

Pulled By: vincentqb

fbshipit-source-id: 8f91154984cd4a753ba1d6105fb8a9bfa0da22b3
2019-11-26 06:53:17 -08:00
0b71e7e1fd Refactor QAT Conv module for better extensibility (#30362)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30362

Right now the qat modules(qat.ConvBn2d, qat.ConvBnReLU2d, qat.Conv2d)
are not convinent to support other dimensions of Conv, this PR refactors
these modules so that we can support Conv1d/Conv3d better

Test Plan:
python test/test_quantization.py

Imported from OSS

Differential Revision: D18691152

fbshipit-source-id: 5b561e6b054eadd31b98cabdf1ac67a61ee9b805
2019-11-26 06:53:12 -08:00
b8f50d9cc8 Support to add dequant for each use of Value (#30145)
Summary:
In this PR, we mainly handle the case there are multiple usage of a Value when inserting the quant-dequant pair. This change will add one dequant for each usage of the Value.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30145

Differential Revision: D18671600

Pulled By: lly-zero-one

fbshipit-source-id: 61324a98861da85b80dcf7e930381311118ae53b
2019-11-25 14:52:58 -08:00
25f4ba7c1b Improve compare kernel (#29743)
Summary:
Currently, the way the compare kernels handle dtypes is very funny (this behavior is introduced in https://github.com/pytorch/pytorch/pull/28427 and I just realize it today):

Let's say `a, b` are two float tensors on CUDA.

If you do `a < b`, this is what would happen inside the loop:
- Step 1: Fetch `a` and `b`, dynamically cast them from `float` to `float`. (i.e. check the scalar type to figure out if it needs cast. it doesn't. so do nothing then.)
- Step 2: compute `a < b`, get a `bool` result
- Step 3: statically cast the result into `float`
- Step 3: do a dynamic cast of the result from `float` to `bool` and store the value

And if you do `a.lt_(b)`, this is what would happen:
- Step 1: Fetch `a` and `b`, no casting
- Step 2: compute `a < b`, get a `bool` result
- Step 3: statically cast the result into `float`
- Step 4: store the result to memory, no casting

Although dynamic casting happens on registers, it still hurt the performance a bit (~8%).

This PR fixes this issue. Now for compare kernels, if the output is bool and inputs have the same dtype, then there is no dynamic casting. Otherwise, there will be dynamic casting for each input and output. That is, the dynamic casting behavior of the two cases described above are swapped.

Benchmark on `a < b` for tensor of 1000000000 fp32 elements:
Before https://github.com/pytorch/pytorch/issues/28427 6.35 ms
Current master: 6.88 ms
With this PR: 6.36 ms
Benchmark on `a.lt_(b)` does not show any difference across versions.

Besides this, what worries me most is, with type promotion, the logic for tensor iterator is becoming super complicated, and it is hard to see if one change causes the performance regression of others. I suggest we create scripts that could benchmark tensor iterator entirely, review that code and put it somewhere inside the repository (maybe under `/tools` or `/test/scripts`?), and whenever we are not certain about the performance we could run it to check. (I guess not on this PR but on PRs after the script is done. If there are worries about performance, the author of PRs should run the script manually, and the reviewer should remind PR author to do so if necessary) If this is a good idea, I will send a PR for the script.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/29743

Differential Revision: D18671269

Pulled By: ngimel

fbshipit-source-id: 89a9c1c8b5fd45d5ae8fe907d65c2fe1a7dfd2dc
2019-11-25 14:52:53 -08:00
5c6705e62c add default arg for init_method (#30208)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30208

Adds default arg for init_method so users don't have to pass this in,
and moves it to `RpcBackendOptions` struct. Removes `init_method` arg from rpc.init_rpc. Also fixes some docs.
ghstack-source-id: 94500475

Test Plan: Unit tests pass.

Reviewed By: mrshenli

Differential Revision: D18630074

fbshipit-source-id: 04b7dd7ec96f4c4da311b71d250233f1f262135a
2019-11-25 14:52:48 -08:00
d64e2581cc Add list of supported XCode/CUDA versions to README
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/30407

Differential Revision: D18689043

Pulled By: smessmer

fbshipit-source-id: cd772451ef31356ed3045ebb1a9c4f5e5e91bb45
2019-11-25 14:52:42 -08:00
0517323dad Update osx CI to XCode 9.4 / CUDA 10.0, cudnn 7.6.5 (#30359)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30359

We need this for C++14 support

ghstack-source-id: 94519850

Test Plan: unit tests

Differential Revision: D18668868

fbshipit-source-id: 87e8eadf0e60a1699fba4524aea53b306b9a7f24
2019-11-25 14:52:37 -08:00
c12f9a12a8 Fix quantized ConvReLU3d test (#30266)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30266

Fix quantized ConvReLU3d test

Test Plan: buck test mode/dev-nosan //caffe2/test:quantized -- "conv"

Reviewed By: hl475

Differential Revision: D18645717

fbshipit-source-id: bbe93f9daf5046f2aa05363efc7d0e59eaff37bf
2019-11-25 14:52:32 -08:00
d7ac90e2ef Stop binding std_single and var_single from TH; they aren't used anymore.
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/29951

Test Plan: Imported from OSS

Differential Revision: D18548057

Pulled By: gchanan

fbshipit-source-id: 0143f694517fa8229e53bd2bc636501804a3f80b
2019-11-25 14:52:27 -08:00
0c67311878 Turn off scalar_check for set_(Storage, ...) (#29950)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/29950

The underlying code handles it correctly.

Test Plan: Imported from OSS

Differential Revision: D18548052

Pulled By: gchanan

fbshipit-source-id: 88b737572c816fb0026ac5e66da7e3f4ab686773
2019-11-25 14:52:22 -08:00
7160300638 Turn off scalar_check for reductions _th_max, _th_min. (#29949)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/29949

The underlying functions handle this already.

Test Plan: Imported from OSS

Differential Revision: D18548047

Pulled By: gchanan

fbshipit-source-id: 123c9297db4e4315da9b1d996ac8b41aa1b4c7bc
2019-11-25 14:52:17 -08:00
16606e1725 Turn off scalar_check for mode; the underlying code is correct.
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/29948

Test Plan: Imported from OSS

Differential Revision: D18548053

Pulled By: gchanan

fbshipit-source-id: 15cdfc24d3e5123497c72dc09c5e6b28cb5e1f88
2019-11-25 14:52:12 -08:00
b8eba7aca9 Turn off scalar_check for ormqr. (#29947)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/29947

It requires > 0-dimensional tensors.

Test Plan: Imported from OSS

Differential Revision: D18548049

Pulled By: gchanan

fbshipit-source-id: ce80a42515b59513a0e5ef2b32e2c2b90b4d64f5
2019-11-25 14:52:07 -08:00
7c6cc1d6d4 Turn off scalar_checks for _th_multinomial_alias_draw. (#29946)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/29946

it requires > 0-dimensional tensors.

Test Plan: Imported from OSS

Differential Revision: D18548050

Pulled By: gchanan

fbshipit-source-id: 4d1e3b53bd701137cc2cb674f95627a5e064a274
2019-11-25 14:52:02 -08:00
6e88ddf352 Turn off scalar_check for _th_addmv and _th_eig as they can never pass. (#29945)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/29945

Both functions require at least 1 2-dimensional tensor, so can never return an inferred scalar.

Test Plan: Imported from OSS

Differential Revision: D18548056

Pulled By: gchanan

fbshipit-source-id: f99a41d490b9a5ab5717534c92e4f2e848c743e8
2019-11-25 14:51:56 -08:00
ce5f1a1b25 Turn off scalar_check for masked_select. (#29923)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/29923

Note that this changes the behavior of masked_select when both "self" and "mask" are 0-dimensional.

In previous versions of PyTorch, this would return a 0-dimensional tensor.  But the documentation reads:
"Returns a new 1-D tensor which indexes the input tensor according to the boolean mask mask which is a BoolTensor."

Test Plan: Imported from OSS

Differential Revision: D18539560

Pulled By: gchanan

fbshipit-source-id: 1637ed2c434fcf8ceead0073aa610581f4a19d21
2019-11-25 14:51:51 -08:00
0c9c62ba6e Turn off scalar_checks for __and__ and clone.
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/29880

Test Plan: Imported from OSS

Differential Revision: D18521732

Pulled By: gchanan

fbshipit-source-id: 7fdf5d8a7b93b43ac32067222cb8df5e790900de
2019-11-25 14:51:46 -08:00
94ad7544ae Turn off scalar_check for __or__
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/29879

Test Plan: Imported from OSS

Differential Revision: D18521745

Pulled By: gchanan

fbshipit-source-id: 93d17d5e9cad5dd6d2c20221d87408c838d74eca
2019-11-25 14:51:40 -08:00
f994377d28 Turn off scalar_check for lshift, rshift.
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/29878

Test Plan: Imported from OSS

Differential Revision: D18521746

Pulled By: gchanan

fbshipit-source-id: 11fd7db79ac8ae76b1a5df25fb0ff59d81fcf394
2019-11-25 14:51:34 -08:00
99a46b44ea Use correct API macro in VariableHooksInterface. (#30320)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30320

Fixes #30296

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Differential Revision: D18665704

Pulled By: ezyang

fbshipit-source-id: f09a953137fcc105959382254f9b8886af5aea3b
2019-11-25 14:51:29 -08:00
20dfae4099 Fix the crashes for c++ not able to find java class through Jni (#30390)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30390

Fix the crashes for c++ not able to find java class through Jni
ghstack-source-id: 94499644

Test Plan: buck install -r fb4a

Reviewed By: ljk53

Differential Revision: D18667992

fbshipit-source-id: aa1b19c6dae39d46440f4a3e691054f7f8b1d42e
2019-11-25 14:51:23 -08:00
3990e9d1ca Improve performance of LeftRight::read() (#30282)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30282

The atomic increment/decrements in LeftRight::read() were measurable in perf benchmarks. Let's improve their perf.
ghstack-source-id: 94443230

Test Plan: unit tests, perf benchmarks

Differential Revision: D18650228

fbshipit-source-id: d184ce8288510ab178e7c7da73562609d1ca3c9f
2019-11-23 15:25:13 -08:00
0c7e4c1d62 backend fallback test (#29682)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/29682

This PR re-introduces backend_fallback_test.cpp, which was previously called boxed_fallback_test.cpp and showed how to use the backend fallback API.
ghstack-source-id: 94481314

Test Plan: unit tests

Differential Revision: D18462654

fbshipit-source-id: 3e9b5c8f35c05f9cd795f44a5fefd1a0aaf03509
2019-11-23 15:25:09 -08:00
959a849a23 better boxing (#29681)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/29681

Remove callUnboxedOnly() and instead use metaprogramming to figure out if an operator can use a boxed fallback or not.
This enables boxed fallback for ops in native_functions.yaml even if they don't have `use_c10_dispatcher: full` set, as long as they're in the range of supported types.
ghstack-source-id: 94481320

Test Plan: unit tests

Differential Revision: D18462653

fbshipit-source-id: 2955e3c4949267520a1734a6a2b919ef5e9684a2
2019-11-23 15:25:05 -08:00
aa2862b843 Hide the OperatorKernel* argument from the stack based kernel API (#29337)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/29337

This argument is needed by boxing wrappers so they're able to get a pointer to the corresponding unboxed kernel and call into it.
But if a kernel is registered in a boxed way, we don't need it and should hide this from the API.
This is especially needed for the backend fallback API where users would only be left wondering why this argument is there and what it does.
Also, hiding it allows us to potentially totally remove it in a future refactoring if we find some way to do so.
ghstack-source-id: 94481316

Test Plan: unit tests

Differential Revision: D18361991

fbshipit-source-id: 5cef26c896fe3f2a5db730d3bc79dcd62e7ef492
2019-11-23 15:25:01 -08:00
afdc0bd4ec OperatorHandle::callBoxed/callUnboxed (#29330)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/29330

This makes for a nicer API, especially in backend fallback kernels who get an OperatorHandle instance and can directly call these methods on it.
ghstack-source-id: 94481322

Test Plan: unit tests stacked on top

Differential Revision: D18357424

fbshipit-source-id: fa8c638335f246c906c8e16186507b4c486afb3f
2019-11-23 15:24:57 -08:00
fb8c17dde1 Test cases for backend fallback kernels (#29214)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/29214

-
ghstack-source-id: 94481312

Test Plan: unit tests

Differential Revision: D18329308

fbshipit-source-id: 1dbae401f2255c69ed16d436f891b9b60c333d81
2019-11-23 15:24:53 -08:00
583c288232 Add a OperatorHandle argument to boxed kernels (#29201)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/29201

This is required for boxed backend fallback kernels (e.g. lazy, AMP) because they need to know which op was actually called.
ghstack-source-id: 94481313

Test Plan: I will add unit tests in a diff stacked on top

Differential Revision: D18282746

fbshipit-source-id: 339a1bbabd6aff31a587b98f095c75104dfc6f99
2019-11-23 15:24:49 -08:00
24aabe439a Make Dispatcher::backendFallbackKernels_ an array (#30340)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30340

We already made OperatorEntry::dispatchTable_ an array to be able to avoid the concurrency primitives there,
but Dispatcher::backendFallbackKernels_ has the same issue. Let's make it a table too.

Since there is some code duplication here, we also factor out the concept of a KernelFunctionTable to be used in both places.
ghstack-source-id: 94481317

Test Plan: unit tests

Differential Revision: D18663426

fbshipit-source-id: ba82ca5c4cae581eea359d5c0c3a5e23b0f8838c
2019-11-23 15:24:45 -08:00
7b5045be9d Remove LeftRight from OperatorEntry and DispatchTable. (#30333)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30333

re-export of https://github.com/pytorch/pytorch/pull/30328
ghstack-source-id: 94481321

Differential Revision: D18661518

fbshipit-source-id: 5a35a1ed2fae3b21a43614957a91d648c21bcca1
2019-11-23 15:24:41 -08:00
4aa692fc91 Convert KernelTable to a flat-indexed array rather than a hashtable. (#30332)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30332

-
ghstack-source-id: 94481315

Reviewed By: resistor

Differential Revision: D18660421

fbshipit-source-id: 9f11434f1c3c234c45f586719182053fa81731f0
2019-11-23 15:24:37 -08:00
7c4b9042ab Updates to quantization documentation (#30288)
Summary:
This pull request includes fixes for six quantization doc bugs.

https://github.com/pytorch/pytorch/issues/30283 - Rendering issue on QConfig
https://github.com/pytorch/pytorch/issues/26305 - Minor doc issue on fuse_modules()
https://github.com/pytorch/pytorch/issues/27451 - Issues with ConvReLU2d, ConvReLU3d, and LinearReLU doc issues
https://github.com/pytorch/pytorch/issues/26899 - Missing docstrings in torch.nn.intrinsic fused functions
https://github.com/pytorch/pytorch/issues/29735 - add discussion of QNNPack to quantization doc page
https://github.com/pytorch/pytorch/issues/27938 - some of the quantized functions lack documentation
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30288

Differential Revision: D18653368

Pulled By: gottbrath

fbshipit-source-id: 410b3dd81ff10909a7f1a7736ca42d7cabf0beb1
2019-11-23 09:29:30 -08:00
7570b2798a updating citation (#30267)
Summary:
NIPS -> NeurIPS
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30267

Differential Revision: D18672928

Pulled By: soumith

fbshipit-source-id: c20f26a0547f94ff39f8ee40e5f0ccc5fcc814af
2019-11-23 07:24:14 -08:00
59ca9b7430 Graph-mode quantization for convolution from traced model (#30245)
Summary:
In the PR, we enhance the graph-mode quantization for aten::_convolution, which could be generated from tracing path.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30245

Differential Revision: D18671597

Pulled By: lly-zero-one

fbshipit-source-id: 78a2470fbb0fe0def55d63c6bda7cbb5c89f7848
2019-11-23 01:24:50 -08:00
2a7a39c1af (de)serialization of values between C++ and Python (#30108)
Summary:
This PR updates `torch::pickle_save` to use the new zipfile format introduced in #29232 and adds `torch::pickle_load` which can decode the zipfile format. Now that `torch.save/load` use this format as well (if the `_use_new_zipfile_serialization` flag is `True`), raw values saved in Python can be loaded in C++ and vice versa.

Fixes #20356
](https://our.intern.facebook.com/intern/diff/18607087/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30108

Pulled By: driazati

Differential Revision: D18607087

fbshipit-source-id: 067cdd5b1cf9c30ddc7e2e5021a8cceee62d8a14
2019-11-23 00:06:07 -08:00
ee20e66c48 replace the SLSRQ for their right emulations in the replayer test (#30367)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30367

use the SLS emulations that match the hardware

Test Plan: replayer test

Differential Revision: D18667605

fbshipit-source-id: 89aee630184737b86ecfb09717437e5c7473e42c
2019-11-23 00:06:03 -08:00
328ec5460f refactor the observer removal and quantize tensor
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/30360

Differential Revision: D18670373

Pulled By: lly-zero-one

fbshipit-source-id: 1481d6e4d5ce40376577b8deb0a0f74d5559076e
2019-11-22 21:25:23 -08:00
6a00191fc2 Add RpcAgent::getWorkerInfos() (#30241)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30241

We need an API to get all worker infos. This will be used by backend-agnostic `rpc.wait_all_workers()` API.
ghstack-source-id: 94454935

Test Plan:
# Unit tests

```
buck test mode/dev-nosan //caffe2/test:rpc_fork -- test_get_worker_infos

buck-out/gen/caffe2/test/rpc_fork\#binary.par -r test_get_worker_infos
```

```
buck test mode/dev-nosan //caffe2/test:rpc_fork_thrift -- test_get_worker_infos

buck-out/gen/caffe2/test/rpc_fork_thrift\#binary.par -r test_get_worker_infos
```

Differential Revision: D5693412

fbshipit-source-id: 5123c8248b6d44fd36b8a5f381dbabb2660e6f0f
2019-11-22 18:26:30 -08:00
c7f988b8c6 transport open registration (#30167)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30167

Pull Request resolved: https://github.com/pytorch/pytorch/pull/29164

- Created GlooDeviceFactory to hide device creation details
- Added transport option while on Python interface

The reason of making the factory class is to make it easier to extend gloo transport in the future

Test Plan: Imported from OSS

Reviewed By: satgera, d4l3k

Differential Revision: D18596527

fbshipit-source-id: e8114162ee8d841c0e0769315b48356b37d6ca0a
2019-11-22 17:41:52 -08:00
ac103a5d78 Remove variable wrapping from register_c10_ops (#29207)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/29207

The logic calling c10 ops from JIT did some variable wrapping to make sure all results are always variables.
Thanks to ezyang, this is not needed anymore because everything is a variable now.
ghstack-source-id: 93345590

Test Plan: waitforsandcastle

Differential Revision: D18327507

fbshipit-source-id: 86512c5e19d6972d70f125feae172461c25e3cb6
2019-11-22 15:32:55 -08:00
9fb879934e Revert D18641413: add unit tests to iOS CI jobs
Test Plan: revert-hammer

Differential Revision:
D18641413

Original commit changeset: 12942206f1de

fbshipit-source-id: 4fa76d50fb897db4342d10a4e46a9887e37ef233
2019-11-22 15:24:27 -08:00
6c9b188262 Support in-place update in IndexHashOp (#30275)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30275

`IndexHash` did not support in-place update.

Reviewed By: kennyhorror

Differential Revision: D18612231

fbshipit-source-id: adeccdf1ceb6107454555ff9cdf66fd5e5773f2a
2019-11-22 14:49:28 -08:00
99a2a0b1ca Implement torch.diagonal for named tensors (#30193)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30193

Featuring:
- Added a NoNamesGuard::reset() function that sets NamesMode back to
what it was before the guard. This makes it so that we don't have to
create a new context to run code in an unnamed way.
- Added a diagonal(Tensor, *, Dimname outdim, Dimname dim1, Dimname dim2, int64_t offset=0)
overload. All of the non-tensor arguments are keyword only for
readability purposes; something like `tensor.diagonal("A", "B", "C")`
would be really confusing.

Test Plan: - Added new tests

Differential Revision: D18638363

Pulled By: zou3519

fbshipit-source-id: ea37b52a19535f84a69be38e95e569e88f307381
2019-11-22 14:49:23 -08:00
2e709763a3 add wrapper to exclude XLA when running device tests
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/30316

Test Plan: Imported from OSS

Differential Revision: D18659286

Pulled By: nairbv

fbshipit-source-id: 86d035bb0c54c612868590c3188cfcd969c3f686
2019-11-22 13:04:59 -08:00
8c6f0c0587 Detect TorchScript archives in torch.load (#29339)
Summary:
This PR looks for a `constants.pkl` file at the top level in a zip file
in `torch.load`. If found, it calls `torch.jit.load` instead and issues
a warning to call `torch.jit.load` directly
Pull Request resolved: https://github.com/pytorch/pytorch/pull/29339

Differential Revision: D18611095

Pulled By: driazati

fbshipit-source-id: f070a02f6b5509054fc3876b3e8356bbbcc183e1
2019-11-22 12:30:30 -08:00
90cb1e67ff Fix exception message in Java Tensor
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/30205

Test Plan: Imported from OSS

Reviewed By: linbinyu

Differential Revision: D18653568

Pulled By: dreiss

fbshipit-source-id: a5fcb809eba641a7fbd0e99e835eceeb248e680c
2019-11-22 12:04:49 -08:00
0c18de2623 Add inferBoundShapeOp
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/30101

Reviewed By: ipiszy

Differential Revision: D18387803

fbshipit-source-id: 5edb6b949257370b62fa6da477bd6ed2f16a9bd1
2019-11-22 12:04:45 -08:00
35e6c1763e Switch Docker image onda-cuda-cxx11-ubuntu1604 to new uniform name (#29943)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/29943

This was apparently the same as "pytorch/pytorch-binary-docker-image-ubuntu16.04:latest",
so standardize on that name.

Test Plan:
This PR, which is stacked on top of a commit that puts one of the jobs
using that container into the set of PR builds.

Imported from OSS

Differential Revision: D18653554

fbshipit-source-id: 40e6c52db02265d61e8166bb1211376faccfc53a
2019-11-22 11:39:55 -08:00
4129 changed files with 366994 additions and 126125 deletions

3
.bazelrc Normal file
View File

@ -0,0 +1,3 @@
build --copt=--std=c++14
build --copt=-I.
build --copt=-isystem --copt bazel-out/k8-fastbuild/bin

1
.bazelversion Normal file
View File

@ -0,0 +1 @@
3.1.0

View File

@ -71,9 +71,9 @@ A **binary configuration** is a collection of
* release or nightly
* releases are stable, nightlies are beta and built every night
* python version
* linux: 2.7m, 2.7mu, 3.5m, 3.6m 3.7m (mu is wide unicode or something like that. It usually doesn't matter but you should know that it exists)
* macos: 2.7, 3.5, 3.6, 3.7
* windows: 3.5, 3.6, 3.7
* linux: 3.5m, 3.6m 3.7m (mu is wide unicode or something like that. It usually doesn't matter but you should know that it exists)
* macos: 3.6, 3.7, 3.8
* windows: 3.6, 3.7, 3.8
* cpu version
* cpu, cuda 9.0, cuda 10.0
* The supported cuda versions occasionally change
@ -466,7 +466,7 @@ But if you want to try, then Id recommend
# Always install miniconda 3, even if building for Python <3
new_conda="~/my_new_conda"
conda_sh="$new_conda/install_miniconda.sh"
curl -o "$conda_sh" https://repo.continuum.io/miniconda/Miniconda3-latest-MacOSX-x86_64.sh
curl -o "$conda_sh" https://repo.anaconda.com/miniconda/Miniconda3-latest-MacOSX-x86_64.sh
chmod +x "$conda_sh"
"$conda_sh" -b -p "$MINICONDA_ROOT"
rm -f "$conda_sh"

View File

@ -5,9 +5,6 @@ for "smoketest" builds.
Each subclass of ConfigNode represents a layer of the configuration hierarchy.
These tree nodes encapsulate the logic for whether a branch of the hierarchy
should be "pruned".
In addition to generating config.yml content, the tree is also traversed
to produce a visualization of config dimensions.
"""
from collections import OrderedDict
@ -34,15 +31,13 @@ def get_processor_arch_name(cuda_version):
LINUX_PACKAGE_VARIANTS = OrderedDict(
manywheel=[
"2.7m",
"2.7mu",
"3.5m",
"3.6m",
"3.7m",
"3.8m",
],
conda=dimensions.STANDARD_PYTHON_VERSIONS,
libtorch=[
"2.7m",
"3.7m",
],
)
@ -52,7 +47,14 @@ CONFIG_TREE_DATA = OrderedDict(
wheel=dimensions.STANDARD_PYTHON_VERSIONS,
conda=dimensions.STANDARD_PYTHON_VERSIONS,
libtorch=[
"2.7",
"3.7",
],
)),
windows=(dimensions.CUDA_VERSIONS, OrderedDict(
wheel=dimensions.STANDARD_PYTHON_VERSIONS,
conda=dimensions.STANDARD_PYTHON_VERSIONS,
libtorch=[
"3.7",
],
)),
)
@ -73,6 +75,11 @@ LINUX_GCC_CONFIG_VARIANTS = OrderedDict(
],
)
WINDOWS_LIBTORCH_CONFIG_VARIANTS = [
"debug",
"release",
]
class TopLevelNode(ConfigNode):
def __init__(self, node_name, config_tree_data, smoke):
@ -107,6 +114,8 @@ class PackageFormatConfigNode(ConfigNode):
def get_children(self):
if self.find_prop("os_name") == "linux":
return [LinuxGccConfigNode(self, v) for v in LINUX_GCC_CONFIG_VARIANTS[self.find_prop("package_format")]]
elif self.find_prop("os_name") == "windows" and self.find_prop("package_format") == "libtorch":
return [WindowsLibtorchConfigNode(self, v) for v in WINDOWS_LIBTORCH_CONFIG_VARIANTS]
else:
return [ArchConfigNode(self, v) for v in self.find_prop("cuda_versions")]
@ -128,6 +137,16 @@ class LinuxGccConfigNode(ConfigNode):
return [ArchConfigNode(self, v) for v in cuda_versions]
class WindowsLibtorchConfigNode(ConfigNode):
def __init__(self, parent, libtorch_config_variant):
super(WindowsLibtorchConfigNode, self).__init__(parent, "LIBTORCH_CONFIG_VARIANT=" + str(libtorch_config_variant))
self.props["libtorch_config_variant"] = libtorch_config_variant
def get_children(self):
return [ArchConfigNode(self, v) for v in self.find_prop("cuda_versions")]
class ArchConfigNode(ConfigNode):
def __init__(self, parent, cu):
super(ArchConfigNode, self).__init__(parent, get_processor_arch_name(cu))
@ -145,8 +164,6 @@ class PyVersionConfigNode(ConfigNode):
self.props["pyver"] = pyver
def get_children(self):
smoke = self.find_prop("smoke")
package_format = self.find_prop("package_format")
os_name = self.find_prop("os_name")

View File

@ -1,12 +1,12 @@
from collections import OrderedDict
import cimodel.data.simple.util.branch_filters as branch_filters
import cimodel.data.binary_build_data as binary_build_data
import cimodel.lib.conf_tree as conf_tree
import cimodel.lib.miniutils as miniutils
class Conf(object):
def __init__(self, os, cuda_version, pydistro, parms, smoke, libtorch_variant, gcc_config_variant):
def __init__(self, os, cuda_version, pydistro, parms, smoke, libtorch_variant, gcc_config_variant, libtorch_config_variant):
self.os = os
self.cuda_version = cuda_version
@ -15,16 +15,19 @@ class Conf(object):
self.smoke = smoke
self.libtorch_variant = libtorch_variant
self.gcc_config_variant = gcc_config_variant
self.libtorch_config_variant = libtorch_config_variant
def gen_build_env_parms(self):
elems = [self.pydistro] + self.parms + [binary_build_data.get_processor_arch_name(self.cuda_version)]
if self.gcc_config_variant is not None:
elems.append(str(self.gcc_config_variant))
if self.libtorch_config_variant is not None:
elems.append(str(self.libtorch_config_variant))
return elems
def gen_docker_image(self):
if self.gcc_config_variant == 'gcc5.4_cxx11-abi':
return miniutils.quote("pytorch/conda-cuda-cxx11-ubuntu1604:latest")
return miniutils.quote("pytorch/pytorch-binary-docker-image-ubuntu16.04:latest")
docker_word_substitution = {
"manywheel": "manylinux",
@ -33,11 +36,9 @@ class Conf(object):
docker_distro_prefix = miniutils.override(self.pydistro, docker_word_substitution)
# The cpu nightlies are built on the pytorch/manylinux-cuda100 docker image
alt_docker_suffix = self.cuda_version or "100"
# The cpu nightlies are built on the pytorch/manylinux-cuda102 docker image
alt_docker_suffix = self.cuda_version or "102"
docker_distro_suffix = "" if self.pydistro == "conda" else alt_docker_suffix
if self.cuda_version == "101":
return "soumith/manylinux-cuda101@sha256:5d62be90d5b7777121180e6137c7eed73d37aaf9f669c51b783611e37e0b4916"
return miniutils.quote("pytorch/" + docker_distro_prefix + "-cuda" + docker_distro_suffix)
def get_name_prefix(self):
@ -63,22 +64,32 @@ class Conf(object):
job_def = OrderedDict()
job_def["name"] = self.gen_build_name(phase, nightly)
job_def["build_environment"] = miniutils.quote(" ".join(self.gen_build_env_parms()))
job_def["requires"] = ["setup"]
if self.smoke:
job_def["requires"].append("update_s3_htmls_for_nightlies")
job_def["requires"].append("update_s3_htmls_for_nightlies_devtoolset7")
job_def["filters"] = {"branches": {"only": "postnightly"}}
job_def["requires"] = [
"update_s3_htmls",
]
job_def["filters"] = branch_filters.gen_filter_dict(
branches_list=["nightly"],
tags_list=[branch_filters.RC_PATTERN],
)
else:
job_def["filters"] = {"branches": {"only": "nightly"}}
if phase in ["upload"]:
filter_branch = "nightly"
else:
filter_branch = r"/.*/"
job_def["filters"] = branch_filters.gen_filter_dict(
branches_list=[filter_branch],
tags_list=[branch_filters.RC_PATTERN],
)
if self.libtorch_variant:
job_def["libtorch_variant"] = miniutils.quote(self.libtorch_variant)
if phase == "test":
if not self.smoke:
job_def["requires"].append(self.gen_build_name("build", nightly))
if not (self.smoke and self.os == "macos"):
job_def["requires"] = [self.gen_build_name("build", nightly)]
if not (self.smoke and self.os == "macos") and self.os != "windows":
job_def["docker_image"] = self.gen_docker_image()
if self.cuda_version:
if self.os != "windows" and self.cuda_version:
job_def["use_cuda_docker_runtime"] = miniutils.quote("1")
else:
if self.os == "linux" and phase != "upload":
@ -86,10 +97,15 @@ class Conf(object):
if phase == "test":
if self.cuda_version:
job_def["resource_class"] = "gpu.medium"
if self.os == "windows":
job_def["executor"] = "windows-with-nvidia-gpu"
else:
job_def["resource_class"] = "gpu.medium"
if phase == "upload":
job_def["context"] = "org-member"
job_def["requires"] = ["setup", self.gen_build_name(upload_phase_dependency, nightly)]
job_def["requires"] = [
self.gen_build_name(upload_phase_dependency, nightly)
]
os_name = miniutils.override(self.os, {"macos": "mac"})
job_name = "_".join([self.get_name_prefix(), os_name, phase])
@ -119,29 +135,54 @@ def gen_build_env_list(smoke):
c.find_prop("smoke"),
c.find_prop("libtorch_variant"),
c.find_prop("gcc_config_variant"),
c.find_prop("libtorch_config_variant"),
)
newlist.append(conf)
return newlist
def predicate_exclude_nonlinux_and_libtorch(config):
return config.os == "linux"
def predicate_exclude_macos(config):
return config.os == "linux" or config.os == "windows"
def get_nightly_uploads():
configs = gen_build_env_list(False)
mylist = []
for conf in configs:
phase_dependency = "test" if predicate_exclude_nonlinux_and_libtorch(conf) else "build"
phase_dependency = "test" if predicate_exclude_macos(conf) else "build"
mylist.append(conf.gen_workflow_job("upload", phase_dependency, nightly=True))
return mylist
def get_post_upload_jobs():
"""Generate jobs to update HTML indices and report binary sizes"""
configs = gen_build_env_list(False)
common_job_def = {
"context": "org-member",
"filters": branch_filters.gen_filter_dict(
branches_list=["nightly"],
tags_list=[branch_filters.RC_PATTERN],
),
"requires": [],
}
for conf in configs:
upload_job_name = conf.gen_build_name(
build_or_test="upload",
nightly=True
)
common_job_def["requires"].append(upload_job_name)
return [
{
"update_s3_htmls": {
"name": "update_s3_htmls",
**common_job_def,
},
},
]
def get_nightly_tests():
configs = gen_build_env_list(False)
filtered_configs = filter(predicate_exclude_nonlinux_and_libtorch, configs)
filtered_configs = filter(predicate_exclude_macos, configs)
tests = []
for conf_options in filtered_configs:

View File

@ -4,8 +4,9 @@ from cimodel.lib.conf_tree import Ver
CONFIG_TREE_DATA = [
(Ver("ubuntu", "16.04"), [
([Ver("gcc", "5")], [XImportant("onnx_py2")]),
([Ver("clang", "7")], [XImportant("onnx_py3.6")]),
([Ver("clang", "7")], [XImportant("onnx_main_py3.6"),
XImportant("onnx_ort1_py3.6"),
XImportant("onnx_ort2_py3.6")]),
]),
]
@ -27,7 +28,9 @@ class TreeConfigNode(ConfigNode):
return [self.child_constructor()(self, k, v) for (k, v) in self.subtree]
def is_build_only(self):
if str(self.find_prop("language_version")) == "onnx_py3.6":
if str(self.find_prop("language_version")) == "onnx_main_py3.6" or \
str(self.find_prop("language_version")) == "onnx_ort1_py3.6" or \
str(self.find_prop("language_version")) == "onnx_ort2_py3.6":
return False
return set(str(c) for c in self.find_prop("compiler_version")).intersection({
"clang3.8",
@ -36,6 +39,12 @@ class TreeConfigNode(ConfigNode):
"android",
}) or self.find_prop("distro_version").name == "macos"
def is_test_only(self):
if str(self.find_prop("language_version")) == "onnx_ort1_py3.6" or \
str(self.find_prop("language_version")) == "onnx_ort2_py3.6":
return True
return False
class TopLevelNode(TreeConfigNode):
def __init__(self, node_name, subtree):
@ -68,6 +77,7 @@ class LanguageConfigNode(TreeConfigNode):
def init2(self, node_name):
self.props["language_version"] = node_name
self.props["build_only"] = self.is_build_only()
self.props["test_only"] = self.is_test_only()
def child_constructor(self):
return ImportantConfigNode

View File

@ -5,14 +5,14 @@ import cimodel.lib.conf_tree as conf_tree
from cimodel.lib.conf_tree import Ver
import cimodel.lib.miniutils as miniutils
from cimodel.data.caffe2_build_data import CONFIG_TREE_DATA, TopLevelNode
from cimodel.data.simple.util.branch_filters import gen_filter_dict
from dataclasses import dataclass
DOCKER_IMAGE_PATH_BASE = "308535385114.dkr.ecr.us-east-1.amazonaws.com/caffe2/"
DOCKER_IMAGE_VERSION = 345
DOCKER_IMAGE_VERSION = "376"
@dataclass
@ -23,6 +23,7 @@ class Conf:
# for gpu files and host compiler (gcc/clang) for cpu files)
compilers: [Ver]
build_only: bool
test_only: bool
is_important: bool
@property
@ -32,8 +33,9 @@ class Conf:
# TODO: Eventually we can probably just remove the cudnn7 everywhere.
def get_cudnn_insertion(self):
omit = self.language == "onnx_py2" \
or self.language == "onnx_py3.6" \
omit = self.language == "onnx_main_py3.6" \
or self.language == "onnx_ort1_py3.6" \
or self.language == "onnx_ort2_py3.6" \
or set(self.compiler_names).intersection({"android", "mkl", "clang"}) \
or str(self.distro) in ["ubuntu14.04", "macos10.13"]
@ -50,6 +52,13 @@ class Conf:
def construct_phase_name(self, phase):
root_parts = self.get_build_name_root_parts()
build_name_substitutions = {
"onnx_ort1_py3.6": "onnx_main_py3.6",
"onnx_ort2_py3.6": "onnx_main_py3.6",
}
if phase == "build":
root_parts = [miniutils.override(r, build_name_substitutions) for r in root_parts]
return "_".join(root_parts + [phase]).replace(".", "_")
def get_platform(self):
@ -61,9 +70,10 @@ class Conf:
def gen_docker_image(self):
lang_substitutions = {
"onnx_py2": "py2",
"onnx_py3.6": "py3.6",
"cmake": "py2",
"onnx_main_py3.6": "py3.6",
"onnx_ort1_py3.6": "py3.6",
"onnx_ort2_py3.6": "py3.6",
"cmake": "py3",
}
lang = miniutils.override(self.language, lang_substitutions)
@ -73,8 +83,10 @@ class Conf:
def gen_workflow_params(self, phase):
parameters = OrderedDict()
lang_substitutions = {
"onnx_py2": "onnx-py2",
"onnx_py3.6": "onnx-py3.6",
"onnx_py3": "onnx-py3",
"onnx_main_py3.6": "onnx-main-py3.6",
"onnx_ort1_py3.6": "onnx-ort1-py3.6",
"onnx_ort2_py3.6": "onnx-ort2-py3.6",
}
lang = miniutils.override(self.language, lang_substitutions)
@ -106,16 +118,15 @@ class Conf:
def gen_workflow_job(self, phase):
job_def = OrderedDict()
job_def["name"] = self.construct_phase_name(phase)
job_def["requires"] = ["setup"]
if phase == "test":
job_def["requires"].append(self.construct_phase_name("build"))
job_def["requires"] = [self.construct_phase_name("build")]
job_name = "caffe2_" + self.get_platform() + "_test"
else:
job_name = "caffe2_" + self.get_platform() + "_build"
if not self.is_important:
job_def["filters"] = {"branches": {"only": ["master", r"/ci-all\/.*/"]}}
job_def["filters"] = gen_filter_dict()
job_def.update(self.gen_workflow_params(phase))
return {job_name : job_def}
@ -136,6 +147,7 @@ def instantiate_configs():
distro=fc.find_prop("distro_version"),
compilers=fc.find_prop("compiler_version"),
build_only=fc.find_prop("build_only"),
test_only=fc.find_prop("test_only"),
is_important=fc.find_prop("important"),
)
@ -150,10 +162,11 @@ def get_workflow_jobs():
x = []
for conf_options in configs:
phases = ["build"]
if not conf_options.build_only:
phases = dimensions.PHASES
if conf_options.test_only:
phases = ["test"]
for phase in phases:
x.append(conf_options.gen_workflow_job(phase))

View File

@ -3,13 +3,12 @@ PHASES = ["build", "test"]
CUDA_VERSIONS = [
None, # cpu build
"92",
"100",
"101",
"102",
]
STANDARD_PYTHON_VERSIONS = [
"2.7",
"3.5",
"3.6",
"3.7",
"3.8"
]

View File

@ -4,17 +4,14 @@ from cimodel.lib.conf_tree import ConfigNode, X, XImportant
CONFIG_TREE_DATA = [
("xenial", [
(None, [
XImportant("2.7.9"),
X("2.7"),
XImportant("3.5"), # Not run on all PRs, but should be included on [test all]
X("nightly"),
]),
("gcc", [
("5.4", [ # All this subtree rebases to master and then build
XImportant("3.6"),
("3.6", [
("parallel_tbb", [XImportant(True)]),
("parallel_native", [XImportant(True)]),
("parallel_tbb", [X(True)]),
("parallel_native", [X(True)]),
]),
]),
# TODO: bring back libtorch test
@ -24,41 +21,44 @@ CONFIG_TREE_DATA = [
("5", [
XImportant("3.6"), # This is actually the ASAN build
]),
# ("7", [
# ("3.6", [
# ("xla", [XImportant(True)]),
# ]),
# ]),
]),
("cuda", [
("9", [
# Note there are magic strings here
# https://github.com/pytorch/pytorch/blob/master/.jenkins/pytorch/build.sh#L21
# and
# https://github.com/pytorch/pytorch/blob/master/.jenkins/pytorch/build.sh#L143
# and
# https://github.com/pytorch/pytorch/blob/master/.jenkins/pytorch/build.sh#L153
# (from https://github.com/pytorch/pytorch/pull/17323#discussion_r259453144)
("9.2", [
X("3.6"),
("3.6", [
("cuda_gcc_override", [X("gcc5.4")])
])
]),
("10.1", [X("3.6")]),
("10.2", [
XImportant("3.6"),
("3.6", [
("libtorch", [XImportant(True)])
]),
]),
("9.2", [X("3.6")]),
("10", [X("3.6")]),
("10.1", [X("3.6")]),
]),
("android", [
("r19c", [
("3.6", [
("android_abi", [XImportant("x86_32")]),
("android_abi", [X("x86_64")]),
("android_abi", [X("arm-v7a")]),
("android_abi", [X("arm-v8a")]),
])
("11.0", [
X("3.8"),
("3.8", [
("libtorch", [X(True)])
]),
]),
]),
]),
("bionic", [
("clang", [
("9", [
XImportant("3.6"),
]),
("9", [
("3.6", [
("xla", [XImportant(True)]),
]),
]),
]),
("gcc", [
("9", [XImportant("3.8")]),
]),
]),
]
@ -101,6 +101,7 @@ class DistroConfigNode(TreeConfigNode):
next_nodes = {
"xenial": XenialCompilerConfigNode,
"bionic": BionicCompilerConfigNode,
}
return next_nodes[distro]
@ -128,7 +129,8 @@ class ExperimentalFeatureConfigNode(TreeConfigNode):
"parallel_native": ParallelNativeConfigNode,
"libtorch": LibTorchConfigNode,
"important": ImportantConfigNode,
"android_abi": AndroidAbiConfigNode,
"build_only": BuildOnlyConfigNode,
"cuda_gcc_override": CudaGccOverrideConfigNode
}
return next_nodes[experimental_feature]
@ -143,6 +145,7 @@ class XlaConfigNode(TreeConfigNode):
def child_constructor(self):
return ImportantConfigNode
class ParallelTBBConfigNode(TreeConfigNode):
def modify_label(self, label):
return "PARALLELTBB=" + str(label)
@ -153,6 +156,7 @@ class ParallelTBBConfigNode(TreeConfigNode):
def child_constructor(self):
return ImportantConfigNode
class ParallelNativeConfigNode(TreeConfigNode):
def modify_label(self, label):
return "PARALLELNATIVE=" + str(label)
@ -163,6 +167,7 @@ class ParallelNativeConfigNode(TreeConfigNode):
def child_constructor(self):
return ImportantConfigNode
class LibTorchConfigNode(TreeConfigNode):
def modify_label(self, label):
return "BUILD_TEST_LIBTORCH=" + str(label)
@ -173,14 +178,23 @@ class LibTorchConfigNode(TreeConfigNode):
def child_constructor(self):
return ImportantConfigNode
class AndroidAbiConfigNode(TreeConfigNode):
class CudaGccOverrideConfigNode(TreeConfigNode):
def init2(self, node_name):
self.props["android_abi"] = node_name
self.props["cuda_gcc_override"] = node_name
def child_constructor(self):
return ImportantConfigNode
class BuildOnlyConfigNode(TreeConfigNode):
def init2(self, node_name):
self.props["build_only"] = node_name
def child_constructor(self):
return ImportantConfigNode
class ImportantConfigNode(TreeConfigNode):
def modify_label(self, label):
return "IMPORTANT=" + str(label)
@ -206,6 +220,20 @@ class XenialCompilerConfigNode(TreeConfigNode):
return XenialCompilerVersionConfigNode if self.props["compiler_name"] else PyVerConfigNode
class BionicCompilerConfigNode(TreeConfigNode):
def modify_label(self, label):
return label or "<unspecified>"
def init2(self, node_name):
self.props["compiler_name"] = node_name
# noinspection PyMethodMayBeStatic
def child_constructor(self):
return BionicCompilerVersionConfigNode if self.props["compiler_name"] else PyVerConfigNode
class XenialCompilerVersionConfigNode(TreeConfigNode):
def init2(self, node_name):
self.props["compiler_version"] = node_name
@ -213,3 +241,12 @@ class XenialCompilerVersionConfigNode(TreeConfigNode):
# noinspection PyMethodMayBeStatic
def child_constructor(self):
return PyVerConfigNode
class BionicCompilerVersionConfigNode(TreeConfigNode):
def init2(self, node_name):
self.props["compiler_version"] = node_name
# noinspection PyMethodMayBeStatic
def child_constructor(self):
return PyVerConfigNode

View File

@ -4,18 +4,13 @@ from cimodel.data.pytorch_build_data import TopLevelNode, CONFIG_TREE_DATA
import cimodel.data.dimensions as dimensions
import cimodel.lib.conf_tree as conf_tree
import cimodel.lib.miniutils as miniutils
from cimodel.data.simple.util.branch_filters import gen_filter_dict
from cimodel.data.simple.util.docker_constants import gen_docker_image_path
from dataclasses import dataclass, field
from typing import List, Optional
DOCKER_IMAGE_PATH_BASE = "308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/"
# ARE YOU EDITING THIS NUMBER? MAKE SURE YOU READ THE GUIDANCE AT THE
# TOP OF .circleci/config.yml
DOCKER_IMAGE_VERSION = 405
@dataclass
class Conf:
distro: str
@ -27,6 +22,7 @@ class Conf:
# tesnrorrt, leveldb, lmdb, redis, opencv, mkldnn, ideep, etc.
# (from https://github.com/pytorch/pytorch/pull/17323#discussion_r259453608)
is_xla: bool = False
vulkan: bool = False
restrict_phases: Optional[List[str]] = None
gpu_resource: Optional[str] = None
dependent_tests: List = field(default_factory=list)
@ -53,9 +49,10 @@ class Conf:
cuda_parms = []
if self.cuda_version:
cuda_parms.extend(["cuda" + self.cuda_version, "cudnn7"])
cudnn = "cudnn8" if self.cuda_version.startswith("11.") else "cudnn7"
cuda_parms.extend(["cuda" + self.cuda_version, cudnn])
result = leading + ["linux", self.distro] + cuda_parms + self.parms
if (not for_docker and self.parms_list_ignored_for_docker_image is not None):
if not for_docker and self.parms_list_ignored_for_docker_image is not None:
result = result + self.parms_list_ignored_for_docker_image
return result
@ -64,7 +61,7 @@ class Conf:
parms_source = self.parent_build or self
base_build_env_name = "-".join(parms_source.get_parms(True))
return miniutils.quote(DOCKER_IMAGE_PATH_BASE + base_build_env_name + ":" + str(DOCKER_IMAGE_VERSION))
return miniutils.quote(gen_docker_image_path(base_build_env_name))
def get_build_job_name_pieces(self, build_or_test):
return self.get_parms(False) + [build_or_test]
@ -92,10 +89,8 @@ class Conf:
return parameters
def gen_workflow_job(self, phase):
# All jobs require the setup job
job_def = OrderedDict()
job_def["name"] = self.gen_build_name(phase)
job_def["requires"] = ["setup"]
if phase == "test":
@ -105,16 +100,13 @@ class Conf:
# pytorch build job (from https://github.com/pytorch/pytorch/pull/17323#discussion_r259452641)
dependency_build = self.parent_build or self
job_def["requires"].append(dependency_build.gen_build_name("build"))
job_def["requires"] = [dependency_build.gen_build_name("build")]
job_name = "pytorch_linux_test"
else:
job_name = "pytorch_linux_build"
if not self.is_important:
# If you update this, update
# caffe2_build_definitions.py too
job_def["filters"] = {"branches": {"only": ["master", r"/ci-all\/.*/"]}}
job_def["filters"] = gen_filter_dict()
job_def.update(self.gen_workflow_params(phase))
return {job_name : job_def}
@ -160,7 +152,13 @@ def gen_dependent_configs(xenial_parent_config):
configs.append(c)
for x in ["pytorch_python_doc_push", "pytorch_cpp_doc_push"]:
return configs
def gen_docs_configs(xenial_parent_config):
configs = []
for x in ["pytorch_python_doc_push", "pytorch_cpp_doc_push", "pytorch_doc_test"]:
configs.append(HiddenConf(x, parent_build=xenial_parent_config))
return configs
@ -182,15 +180,19 @@ def instantiate_configs():
root = get_root()
found_configs = conf_tree.dfs(root)
restrict_phases = None
for fc in found_configs:
restrict_phases = None
distro_name = fc.find_prop("distro_name")
compiler_name = fc.find_prop("compiler_name")
compiler_version = fc.find_prop("compiler_version")
is_xla = fc.find_prop("is_xla") or False
parms_list_ignored_for_docker_image = []
vulkan = fc.find_prop("vulkan") or False
if vulkan:
parms_list_ignored_for_docker_image.append("vulkan")
python_version = None
if compiler_name == "cuda" or compiler_name == "android":
python_version = fc.find_prop("pyver")
@ -210,25 +212,27 @@ def instantiate_configs():
android_abi = fc.find_prop("android_abi")
parms_list_ignored_for_docker_image.append(android_abi)
restrict_phases = ["build"]
fc.props["is_important"] = True
elif compiler_name:
gcc_version = compiler_name + (fc.find_prop("compiler_version") or "")
parms_list.append(gcc_version)
# TODO: This is a nasty special case
if compiler_name == "clang" and not is_xla:
if gcc_version == 'clang5' and not is_xla:
parms_list.append("asan")
python_version = fc.find_prop("pyver")
parms_list[0] = fc.find_prop("abbreviated_pyver")
if cuda_version in ["9.2", "10", "10.1"]:
# TODO The gcc version is orthogonal to CUDA version?
parms_list.append("gcc7")
if cuda_version:
cuda_gcc_version = fc.find_prop("cuda_gcc_override") or "gcc7"
parms_list.append(cuda_gcc_version)
is_libtorch = fc.find_prop("is_libtorch") or False
is_important = fc.find_prop("is_important") or False
parallel_backend = fc.find_prop("parallel_backend") or None
build_only = fc.find_prop("build_only") or False
if build_only and restrict_phases is None:
restrict_phases = ["build"]
gpu_resource = None
if cuda_version and cuda_version != "10":
@ -241,6 +245,7 @@ def instantiate_configs():
python_version,
cuda_version,
is_xla,
vulkan,
restrict_phases,
gpu_resource,
is_libtorch=is_libtorch,
@ -248,7 +253,16 @@ def instantiate_configs():
parallel_backend=parallel_backend,
)
if cuda_version == "9" and python_version == "3.6" and not is_libtorch:
# run docs builds on "pytorch-linux-xenial-py3.6-gcc5.4". Docs builds
# should run on a CPU-only build that runs on all PRs.
if distro_name == 'xenial' and fc.find_prop("pyver") == '3.6' \
and cuda_version is None \
and parallel_backend is None \
and compiler_name == 'gcc' \
and fc.find_prop('compiler_version') == '5.4':
c.dependent_tests = gen_docs_configs(c)
if cuda_version == "10.1" and python_version == "3.6" and not is_libtorch:
c.dependent_tests = gen_dependent_configs(c)
if (compiler_name == "gcc"
@ -275,7 +289,7 @@ def get_workflow_jobs():
config_list = instantiate_configs()
x = ["setup"]
x = []
for conf_options in config_list:
phases = conf_options.restrict_phases or dimensions.PHASES

View File

@ -0,0 +1,92 @@
import cimodel.data.simple.util.branch_filters
from cimodel.data.simple.util.docker_constants import DOCKER_IMAGE_NDK
class AndroidJob:
def __init__(self,
variant,
template_name,
is_master_only=True):
self.variant = variant
self.template_name = template_name
self.is_master_only = is_master_only
def gen_tree(self):
base_name_parts = [
"pytorch",
"linux",
"xenial",
"py3",
"clang5",
"android",
"ndk",
"r19c",
] + self.variant + [
"build",
]
full_job_name = "_".join(base_name_parts)
build_env_name = "-".join(base_name_parts)
props_dict = {
"name": full_job_name,
"build_environment": "\"{}\"".format(build_env_name),
"docker_image": "\"{}\"".format(DOCKER_IMAGE_NDK),
}
if self.is_master_only:
props_dict["filters"] = cimodel.data.simple.util.branch_filters.gen_filter_dict()
return [{self.template_name: props_dict}]
class AndroidGradleJob:
def __init__(self,
job_name,
template_name,
dependencies,
is_master_only=True):
self.job_name = job_name
self.template_name = template_name
self.dependencies = dependencies
self.is_master_only = is_master_only
def gen_tree(self):
props_dict = {
"name": self.job_name,
"requires": self.dependencies,
}
if self.is_master_only:
props_dict["filters"] = cimodel.data.simple.util.branch_filters.gen_filter_dict()
return [{self.template_name: props_dict}]
WORKFLOW_DATA = [
AndroidJob(["x86_32"], "pytorch_linux_build", is_master_only=False),
AndroidJob(["x86_64"], "pytorch_linux_build"),
AndroidJob(["arm", "v7a"], "pytorch_linux_build"),
AndroidJob(["arm", "v8a"], "pytorch_linux_build"),
AndroidJob(["vulkan", "x86_32"], "pytorch_linux_build", is_master_only=False),
AndroidGradleJob(
"pytorch-linux-xenial-py3-clang5-android-ndk-r19c-gradle-build-x86_32",
"pytorch_android_gradle_build-x86_32",
["pytorch_linux_xenial_py3_clang5_android_ndk_r19c_x86_32_build"],
is_master_only=False),
AndroidGradleJob(
"pytorch-linux-xenial-py3-clang5-android-ndk-r19c-gradle-build",
"pytorch_android_gradle_build",
["pytorch_linux_xenial_py3_clang5_android_ndk_r19c_x86_32_build",
"pytorch_linux_xenial_py3_clang5_android_ndk_r19c_x86_64_build",
"pytorch_linux_xenial_py3_clang5_android_ndk_r19c_arm_v7a_build",
"pytorch_linux_xenial_py3_clang5_android_ndk_r19c_arm_v8a_build"]),
]
def get_workflow_jobs():
return [item.gen_tree() for item in WORKFLOW_DATA]

View File

@ -0,0 +1,63 @@
from cimodel.data.simple.util.docker_constants import DOCKER_IMAGE_GCC7
def gen_job_name(phase):
job_name_parts = [
"pytorch",
"bazel",
phase,
]
return "_".join(job_name_parts)
class BazelJob:
def __init__(self, phase, extra_props=None):
self.phase = phase
self.extra_props = extra_props or {}
def gen_tree(self):
template_parts = [
"pytorch",
"linux",
"bazel",
self.phase,
]
build_env_parts = [
"pytorch",
"linux",
"xenial",
"py3.6",
"gcc7",
"bazel",
self.phase,
]
full_job_name = gen_job_name(self.phase)
build_env_name = "-".join(build_env_parts)
extra_requires = [gen_job_name("build")] if self.phase == "test" else []
props_dict = {
"build_environment": build_env_name,
"docker_image": DOCKER_IMAGE_GCC7,
"name": full_job_name,
"requires": extra_requires,
}
props_dict.update(self.extra_props)
template_name = "_".join(template_parts)
return [{template_name: props_dict}]
WORKFLOW_DATA = [
BazelJob("build", {"resource_class": "large"}),
BazelJob("test"),
]
def get_workflow_jobs():
return [item.gen_tree() for item in WORKFLOW_DATA]

View File

@ -0,0 +1,193 @@
"""
TODO: Refactor circleci/cimodel/data/binary_build_data.py to generate this file
instead of doing one offs here
Binary builds (subset, to smoke test that they'll work)
NB: If you modify this file, you need to also modify
the binary_and_smoke_tests_on_pr variable in
pytorch-ci-hud to adjust the list of whitelisted builds
at https://github.com/ezyang/pytorch-ci-hud/blob/master/src/BuildHistoryDisplay.js
Note:
This binary build is currently broken, see https://github_com/pytorch/pytorch/issues/16710
- binary_linux_conda_3_6_cu90_devtoolset7_build
- binary_linux_conda_3_6_cu90_devtoolset7_test
TODO
we should test a libtorch cuda build, but they take too long
- binary_linux_libtorch_3_6m_cu90_devtoolset7_static-without-deps_build
"""
import cimodel.lib.miniutils as miniutils
import cimodel.data.simple.util.branch_filters
class SmoketestJob:
def __init__(self,
template_name,
build_env_parts,
docker_image,
job_name,
is_master_only=False,
requires=None,
has_libtorch_variant=False,
extra_props=None):
self.template_name = template_name
self.build_env_parts = build_env_parts
self.docker_image = docker_image
self.job_name = job_name
self.is_master_only = is_master_only
self.requires = requires or []
self.has_libtorch_variant = has_libtorch_variant
self.extra_props = extra_props or {}
def gen_tree(self):
props_dict = {
"build_environment": " ".join(self.build_env_parts),
"name": self.job_name,
"requires": self.requires,
}
if self.docker_image:
props_dict["docker_image"] = self.docker_image
if self.is_master_only:
props_dict["filters"] = cimodel.data.simple.util.branch_filters.gen_filter_dict()
if self.has_libtorch_variant:
props_dict["libtorch_variant"] = "shared-with-deps"
props_dict.update(self.extra_props)
return [{self.template_name: props_dict}]
WORKFLOW_DATA = [
SmoketestJob(
"binary_linux_build",
["manywheel", "3.7m", "cu102", "devtoolset7"],
"pytorch/manylinux-cuda102",
"binary_linux_manywheel_3_7m_cu102_devtoolset7_build",
is_master_only=True,
),
SmoketestJob(
"binary_linux_build",
["libtorch", "3.7m", "cpu", "devtoolset7"],
"pytorch/manylinux-cuda102",
"binary_linux_libtorch_3_7m_cpu_devtoolset7_shared-with-deps_build",
is_master_only=False,
has_libtorch_variant=True,
),
SmoketestJob(
"binary_linux_build",
["libtorch", "3.7m", "cpu", "gcc5.4_cxx11-abi"],
"pytorch/pytorch-binary-docker-image-ubuntu16.04:latest",
"binary_linux_libtorch_3_7m_cpu_gcc5_4_cxx11-abi_shared-with-deps_build",
is_master_only=False,
has_libtorch_variant=True,
),
SmoketestJob(
"binary_mac_build",
["wheel", "3.7", "cpu"],
None,
"binary_macos_wheel_3_7_cpu_build",
is_master_only=True,
),
# This job has an average run time of 3 hours o.O
# Now only running this on master to reduce overhead
SmoketestJob(
"binary_mac_build",
["libtorch", "3.7", "cpu"],
None,
"binary_macos_libtorch_3_7_cpu_build",
is_master_only=True,
),
SmoketestJob(
"binary_windows_build",
["libtorch", "3.7", "cpu", "debug"],
None,
"binary_windows_libtorch_3_7_cpu_debug_build",
is_master_only=False,
),
SmoketestJob(
"binary_windows_build",
["libtorch", "3.7", "cpu", "release"],
None,
"binary_windows_libtorch_3_7_cpu_release_build",
is_master_only=False,
),
SmoketestJob(
"binary_windows_build",
["wheel", "3.7", "cu102"],
None,
"binary_windows_wheel_3_7_cu102_build",
is_master_only=True,
),
SmoketestJob(
"binary_windows_test",
["libtorch", "3.7", "cpu", "debug"],
None,
"binary_windows_libtorch_3_7_cpu_debug_test",
is_master_only=False,
requires=["binary_windows_libtorch_3_7_cpu_debug_build"],
),
SmoketestJob(
"binary_windows_test",
["libtorch", "3.7", "cpu", "release"],
None,
"binary_windows_libtorch_3_7_cpu_release_test",
is_master_only=False,
requires=["binary_windows_libtorch_3_7_cpu_release_build"],
),
SmoketestJob(
"binary_windows_test",
["wheel", "3.7", "cu102"],
None,
"binary_windows_wheel_3_7_cu102_test",
is_master_only=True,
requires=["binary_windows_wheel_3_7_cu102_build"],
extra_props={
"executor": "windows-with-nvidia-gpu",
},
),
SmoketestJob(
"binary_linux_test",
["manywheel", "3.7m", "cu102", "devtoolset7"],
"pytorch/manylinux-cuda102",
"binary_linux_manywheel_3_7m_cu102_devtoolset7_test",
is_master_only=True,
requires=["binary_linux_manywheel_3_7m_cu102_devtoolset7_build"],
extra_props={
"resource_class": "gpu.medium",
"use_cuda_docker_runtime": miniutils.quote((str(1))),
},
),
SmoketestJob(
"binary_linux_test",
["libtorch", "3.7m", "cpu", "devtoolset7"],
"pytorch/manylinux-cuda102",
"binary_linux_libtorch_3_7m_cpu_devtoolset7_shared-with-deps_test",
is_master_only=False,
requires=["binary_linux_libtorch_3_7m_cpu_devtoolset7_shared-with-deps_build"],
has_libtorch_variant=True,
),
SmoketestJob(
"binary_linux_test",
["libtorch", "3.7m", "cpu", "gcc5.4_cxx11-abi"],
"pytorch/pytorch-binary-docker-image-ubuntu16.04:latest",
"binary_linux_libtorch_3_7m_cpu_gcc5_4_cxx11-abi_shared-with-deps_test",
is_master_only=False,
requires=["binary_linux_libtorch_3_7m_cpu_gcc5_4_cxx11-abi_shared-with-deps_build"],
has_libtorch_variant=True,
),
]
def get_workflow_jobs():
return [item.gen_tree() for item in WORKFLOW_DATA]

View File

@ -0,0 +1,44 @@
from collections import OrderedDict
from cimodel.lib.miniutils import quote
# TODO: make this generated from a matrix rather than just a static list
IMAGE_NAMES = [
"pytorch-linux-bionic-cuda11.0-cudnn8-py3.6-gcc9",
"pytorch-linux-bionic-cuda11.0-cudnn8-py3.8-gcc9",
"pytorch-linux-bionic-cuda10.2-cudnn7-py3.8-gcc9",
"pytorch-linux-bionic-py3.6-clang9",
"pytorch-linux-bionic-cuda10.2-cudnn7-py3.6-clang9",
"pytorch-linux-bionic-py3.8-gcc9",
"pytorch-linux-xenial-cuda10-cudnn7-py3-gcc7",
"pytorch-linux-xenial-cuda10.1-cudnn7-py3-gcc7",
"pytorch-linux-xenial-cuda10.2-cudnn7-py3-gcc7",
"pytorch-linux-xenial-cuda11.0-cudnn8-py3-gcc7",
"pytorch-linux-xenial-cuda9.2-cudnn7-py3-gcc5.4",
"pytorch-linux-xenial-cuda9.2-cudnn7-py3-gcc7",
"pytorch-linux-xenial-py3-clang5-android-ndk-r19c",
"pytorch-linux-xenial-py3-clang5-asan",
"pytorch-linux-xenial-py3.8",
"pytorch-linux-xenial-py3.6-clang7",
"pytorch-linux-xenial-py3.6-gcc4.8",
"pytorch-linux-xenial-py3.6-gcc5.4",
"pytorch-linux-xenial-py3.6-gcc7.2",
"pytorch-linux-xenial-py3.6-gcc7",
"pytorch-linux-xenial-pynightly",
"pytorch-linux-xenial-rocm3.3-py3.6",
]
def get_workflow_jobs():
"""Generates a list of docker image build definitions"""
return [
OrderedDict(
{
"docker_build_job": OrderedDict(
{"name": quote(image_name), "image_name": quote(image_name)}
)
}
)
for image_name in IMAGE_NAMES
]

View File

@ -0,0 +1,103 @@
import cimodel.lib.miniutils as miniutils
from cimodel.data.simple.util.versions import MultiPartVersion, CudaVersion
from cimodel.data.simple.util.docker_constants import DOCKER_IMAGE_BASIC, DOCKER_IMAGE_CUDA_10_2
class GeConfigTestJob:
def __init__(self,
py_version,
gcc_version,
cuda_version,
variant_parts,
extra_requires,
use_cuda_docker=False,
build_env_override=None):
self.py_version = py_version
self.gcc_version = gcc_version
self.cuda_version = cuda_version
self.variant_parts = variant_parts
self.extra_requires = extra_requires
self.use_cuda_docker = use_cuda_docker
self.build_env_override = build_env_override
def get_all_parts(self, with_dots):
maybe_py_version = self.py_version.render_dots_or_parts(with_dots) if self.py_version else []
maybe_gcc_version = self.gcc_version.render_dots_or_parts(with_dots) if self.gcc_version else []
maybe_cuda_version = self.cuda_version.render_dots_or_parts(with_dots) if self.cuda_version else []
common_parts = [
"pytorch",
"linux",
"xenial",
] + maybe_cuda_version + maybe_py_version + maybe_gcc_version
return common_parts + self.variant_parts
def gen_tree(self):
resource_class = "gpu.medium" if self.use_cuda_docker else "large"
docker_image = DOCKER_IMAGE_CUDA_10_2 if self.use_cuda_docker else DOCKER_IMAGE_BASIC
full_name = "_".join(self.get_all_parts(False))
build_env = self.build_env_override or "-".join(self.get_all_parts(True))
props_dict = {
"name": full_name,
"build_environment": build_env,
"requires": self.extra_requires,
"resource_class": resource_class,
"docker_image": docker_image,
}
if self.use_cuda_docker:
props_dict["use_cuda_docker_runtime"] = miniutils.quote(str(1))
return [{"pytorch_linux_test": props_dict}]
WORKFLOW_DATA = [
GeConfigTestJob(
MultiPartVersion([3, 6], "py"),
MultiPartVersion([5, 4], "gcc"),
None,
["ge_config_legacy", "test"],
["pytorch_linux_xenial_py3_6_gcc5_4_build"]),
GeConfigTestJob(
MultiPartVersion([3, 6], "py"),
MultiPartVersion([5, 4], "gcc"),
None,
["ge_config_profiling", "test"],
["pytorch_linux_xenial_py3_6_gcc5_4_build"]),
GeConfigTestJob(
MultiPartVersion([3, 6], "py"),
MultiPartVersion([5, 4], "gcc"),
None,
["ge_config_simple", "test"],
["pytorch_linux_xenial_py3_6_gcc5_4_build"],
),
GeConfigTestJob(
None,
None,
CudaVersion(10, 2),
["cudnn7", "py3", "ge_config_legacy", "test"],
["pytorch_linux_xenial_cuda10_2_cudnn7_py3_gcc7_build"],
use_cuda_docker=True,
# TODO Why does the build environment specify cuda10.1, while the
# job name is cuda10_2?
build_env_override="pytorch-linux-xenial-cuda10.1-cudnn7-ge_config_legacy-test"),
GeConfigTestJob(
None,
None,
CudaVersion(10, 2),
["cudnn7", "py3", "ge_config_profiling", "test"],
["pytorch_linux_xenial_cuda10_2_cudnn7_py3_gcc7_build"],
use_cuda_docker=True,
# TODO Why does the build environment specify cuda10.1, while the
# job name is cuda10_2?
build_env_override="pytorch-linux-xenial-cuda10.1-cudnn7-ge_config_profiling-test"),
]
def get_workflow_jobs():
return [item.gen_tree() for item in WORKFLOW_DATA]

View File

@ -0,0 +1,71 @@
from cimodel.data.simple.util.versions import MultiPartVersion
IOS_VERSION = MultiPartVersion([11, 2, 1])
class ArchVariant:
def __init__(self, name, is_custom=False):
self.name = name
self.is_custom = is_custom
def render(self):
extra_parts = ["custom"] if self.is_custom else []
return "_".join([self.name] + extra_parts)
def get_platform(arch_variant_name):
return "SIMULATOR" if arch_variant_name == "x86_64" else "OS"
class IOSJob:
def __init__(self, ios_version, arch_variant, is_org_member_context=True, extra_props=None):
self.ios_version = ios_version
self.arch_variant = arch_variant
self.is_org_member_context = is_org_member_context
self.extra_props = extra_props
def gen_name_parts(self, with_version_dots):
version_parts = self.ios_version.render_dots_or_parts(with_version_dots)
build_variant_suffix = "_".join([self.arch_variant.render(), "build"])
return [
"pytorch",
"ios",
] + version_parts + [
build_variant_suffix,
]
def gen_job_name(self):
return "_".join(self.gen_name_parts(False))
def gen_tree(self):
platform_name = get_platform(self.arch_variant.name)
props_dict = {
"build_environment": "-".join(self.gen_name_parts(True)),
"ios_arch": self.arch_variant.name,
"ios_platform": platform_name,
"name": self.gen_job_name(),
}
if self.is_org_member_context:
props_dict["context"] = "org-member"
if self.extra_props:
props_dict.update(self.extra_props)
return [{"pytorch_ios_build": props_dict}]
WORKFLOW_DATA = [
IOSJob(IOS_VERSION, ArchVariant("x86_64"), is_org_member_context=False),
IOSJob(IOS_VERSION, ArchVariant("arm64")),
IOSJob(IOS_VERSION, ArchVariant("arm64", True), extra_props={"op_list": "mobilenetv2.yaml"}),
]
def get_workflow_jobs():
return [item.gen_tree() for item in WORKFLOW_DATA]

View File

@ -0,0 +1,28 @@
class MacOsJob:
def __init__(self, os_version, is_test=False):
self.os_version = os_version
self.is_test = is_test
def gen_tree(self):
non_phase_parts = ["pytorch", "macos", self.os_version, "py3"]
phase_name = "test" if self.is_test else "build"
full_job_name = "_".join(non_phase_parts + [phase_name])
test_build_dependency = "_".join(non_phase_parts + ["build"])
extra_dependencies = [test_build_dependency] if self.is_test else []
job_dependencies = extra_dependencies
# Yes we name the job after itself, it needs a non-empty value in here
# for the YAML output to work.
props_dict = {"requires": job_dependencies, "name": full_job_name}
return [{full_job_name: props_dict}]
WORKFLOW_DATA = [MacOsJob("10_13"), MacOsJob("10_13", True)]
def get_workflow_jobs():
return [item.gen_tree() for item in WORKFLOW_DATA]

View File

@ -0,0 +1,56 @@
"""
PyTorch Mobile PR builds (use linux host toolchain + mobile build options)
"""
import cimodel.lib.miniutils as miniutils
import cimodel.data.simple.util.branch_filters
from cimodel.data.simple.util.docker_constants import DOCKER_IMAGE_ASAN, DOCKER_IMAGE_NDK
class MobileJob:
def __init__(self, docker_image, variant_parts, is_master_only=False):
self.docker_image = docker_image
self.variant_parts = variant_parts
self.is_master_only = is_master_only
def gen_tree(self):
non_phase_parts = [
"pytorch",
"linux",
"xenial",
"py3",
"clang5",
"mobile",
] + self.variant_parts
full_job_name = "_".join(non_phase_parts)
build_env_name = "-".join(non_phase_parts)
props_dict = {
"build_environment": build_env_name,
"build_only": miniutils.quote(str(int(True))),
"docker_image": self.docker_image,
"name": full_job_name,
}
if self.is_master_only:
props_dict["filters"] = cimodel.data.simple.util.branch_filters.gen_filter_dict()
return [{"pytorch_linux_build": props_dict}]
WORKFLOW_DATA = [
MobileJob(DOCKER_IMAGE_ASAN, ["build"]),
MobileJob(DOCKER_IMAGE_ASAN, ["custom", "build", "static"]),
# Use LLVM-DEV toolchain in android-ndk-r19c docker image
MobileJob(DOCKER_IMAGE_NDK, ["custom", "build", "dynamic"]),
# Use LLVM-DEV toolchain in android-ndk-r19c docker image
# Most of this CI is already covered by "mobile-custom-build-dynamic" job
MobileJob(DOCKER_IMAGE_NDK, ["code", "analysis"], True),
]
def get_workflow_jobs():
return [item.gen_tree() for item in WORKFLOW_DATA]

View File

@ -0,0 +1,73 @@
from cimodel.data.simple.util.docker_constants import DOCKER_IMAGE_NDK
class AndroidNightlyJob:
def __init__(self,
variant,
template_name,
extra_props=None,
with_docker=True,
requires=None,
no_build_suffix=False):
self.variant = variant
self.template_name = template_name
self.extra_props = extra_props or {}
self.with_docker = with_docker
self.requires = requires
self.no_build_suffix = no_build_suffix
def gen_tree(self):
base_name_parts = [
"pytorch",
"linux",
"xenial",
"py3",
"clang5",
"android",
"ndk",
"r19c",
] + self.variant
build_suffix = [] if self.no_build_suffix else ["build"]
full_job_name = "_".join(["nightly"] + base_name_parts + build_suffix)
build_env_name = "-".join(base_name_parts)
props_dict = {
"name": full_job_name,
"requires": self.requires,
"filters": {"branches": {"only": "nightly"}},
}
props_dict.update(self.extra_props)
if self.with_docker:
props_dict["docker_image"] = DOCKER_IMAGE_NDK
props_dict["build_environment"] = build_env_name
return [{self.template_name: props_dict}]
WORKFLOW_DATA = [
AndroidNightlyJob(["x86_32"], "pytorch_linux_build"),
AndroidNightlyJob(["x86_64"], "pytorch_linux_build"),
AndroidNightlyJob(["arm", "v7a"], "pytorch_linux_build"),
AndroidNightlyJob(["arm", "v8a"], "pytorch_linux_build"),
AndroidNightlyJob(["android_gradle"], "pytorch_android_gradle_build",
with_docker=False,
requires=[
"nightly_pytorch_linux_xenial_py3_clang5_android_ndk_r19c_x86_32_build",
"nightly_pytorch_linux_xenial_py3_clang5_android_ndk_r19c_x86_64_build",
"nightly_pytorch_linux_xenial_py3_clang5_android_ndk_r19c_arm_v7a_build",
"nightly_pytorch_linux_xenial_py3_clang5_android_ndk_r19c_arm_v8a_build"]),
AndroidNightlyJob(["x86_32_android_publish_snapshot"], "pytorch_android_publish_snapshot",
extra_props={"context": "org-member"},
with_docker=False,
requires=["nightly_pytorch_linux_xenial_py3_clang5_android_ndk_r19c_android_gradle_build"],
no_build_suffix=True),
]
def get_workflow_jobs():
return [item.gen_tree() for item in WORKFLOW_DATA]

View File

@ -0,0 +1,68 @@
import cimodel.data.simple.ios_definitions as ios_definitions
class IOSNightlyJob:
def __init__(self,
variant,
is_upload=False):
self.variant = variant
self.is_upload = is_upload
def get_phase_name(self):
return "upload" if self.is_upload else "build"
def get_common_name_pieces(self, with_version_dots):
extra_name_suffix = [self.get_phase_name()] if self.is_upload else []
common_name_pieces = [
"ios",
] + ios_definitions.IOS_VERSION.render_dots_or_parts(with_version_dots) + [
"nightly",
self.variant,
"build",
] + extra_name_suffix
return common_name_pieces
def gen_job_name(self):
return "_".join(["pytorch"] + self.get_common_name_pieces(False))
def gen_tree(self):
extra_requires = [x.gen_job_name() for x in BUILD_CONFIGS] if self.is_upload else []
props_dict = {
"build_environment": "-".join(["libtorch"] + self.get_common_name_pieces(True)),
"requires": extra_requires,
"context": "org-member",
"filters": {"branches": {"only": "nightly"}},
}
if not self.is_upload:
props_dict["ios_arch"] = self.variant
props_dict["ios_platform"] = ios_definitions.get_platform(self.variant)
props_dict["name"] = self.gen_job_name()
template_name = "_".join([
"binary",
"ios",
self.get_phase_name(),
])
return [{template_name: props_dict}]
BUILD_CONFIGS = [
IOSNightlyJob("x86_64"),
IOSNightlyJob("arm64"),
]
WORKFLOW_DATA = BUILD_CONFIGS + [
IOSNightlyJob("binary", is_upload=True),
]
def get_workflow_jobs():
return [item.gen_tree() for item in WORKFLOW_DATA]

View File

@ -0,0 +1,22 @@
NON_PR_BRANCH_LIST = [
"master",
r"/ci-all\/.*/",
r"/release\/.*/",
]
RC_PATTERN = r"/v[0-9]+(\.[0-9]+)*-rc[0-9]+/"
def gen_filter_dict(
branches_list=NON_PR_BRANCH_LIST,
tags_list=None
):
"""Generates a filter dictionary for use with CircleCI's job filter"""
filter_dict = {
"branches": {
"only": branches_list,
},
}
if tags_list is not None:
filter_dict["tags"] = {"only": tags_list}
return filter_dict

View File

@ -0,0 +1,30 @@
AWS_DOCKER_HOST = "308535385114.dkr.ecr.us-east-1.amazonaws.com"
# ARE YOU EDITING THIS NUMBER? MAKE SURE YOU READ THE GUIDANCE AT THE
# TOP OF .circleci/config.yml
DOCKER_IMAGE_TAG = "209062ef-ab58-422a-b295-36c4eed6e906"
def gen_docker_image_path(container_type):
return "/".join([
AWS_DOCKER_HOST,
"pytorch",
container_type + ":" + DOCKER_IMAGE_TAG,
])
DOCKER_IMAGE_BASIC = gen_docker_image_path("pytorch-linux-xenial-py3.6-gcc5.4")
DOCKER_IMAGE_CUDA_10_2 = gen_docker_image_path("pytorch-linux-xenial-cuda10.2-cudnn7-py3-gcc7")
DOCKER_IMAGE_GCC7 = gen_docker_image_path("pytorch-linux-xenial-py3.6-gcc7")
def gen_mobile_docker_name(specifier):
container_type = "pytorch-linux-xenial-py3-clang5-" + specifier
return gen_docker_image_path(container_type)
DOCKER_IMAGE_ASAN = gen_mobile_docker_name("asan")
DOCKER_IMAGE_NDK = gen_mobile_docker_name("android-ndk-r19c")

View File

@ -0,0 +1,31 @@
class MultiPartVersion:
def __init__(self, parts, prefix=""):
self.parts = parts
self.prefix = prefix
def prefixed_parts(self):
"""
Prepends the first element of the version list
with the prefix string.
"""
if self.parts:
return [self.prefix + str(self.parts[0])] + list(map(str, self.parts[1:]))
else:
return [self.prefix]
def render_dots(self):
return ".".join(self.prefixed_parts())
def render_dots_or_parts(self, with_dots):
if with_dots:
return [self.render_dots()]
else:
return self.prefixed_parts()
class CudaVersion(MultiPartVersion):
def __init__(self, major, minor):
self.major = major
self.minor = minor
super().__init__([self.major, self.minor], "cuda")

View File

@ -0,0 +1,142 @@
import cimodel.data.simple.util.branch_filters
import cimodel.lib.miniutils as miniutils
from cimodel.data.simple.util.versions import CudaVersion
class WindowsJob:
def __init__(
self,
test_index,
vscode_spec,
cuda_version,
force_on_cpu=False,
master_only_pred=lambda job: job.vscode_spec.year != 2019,
):
self.test_index = test_index
self.vscode_spec = vscode_spec
self.cuda_version = cuda_version
self.force_on_cpu = force_on_cpu
self.master_only_pred = master_only_pred
def gen_tree(self):
base_phase = "build" if self.test_index is None else "test"
numbered_phase = (
base_phase if self.test_index is None else base_phase + str(self.test_index)
)
key_name = "_".join(["pytorch", "windows", base_phase])
cpu_forcing_name_parts = ["on", "cpu"] if self.force_on_cpu else []
target_arch = self.cuda_version.render_dots() if self.cuda_version else "cpu"
base_name_parts = [
"pytorch",
"windows",
self.vscode_spec.render(),
"py36",
target_arch,
]
prerequisite_jobs = []
if base_phase == "test":
prerequisite_jobs.append("_".join(base_name_parts + ["build"]))
arch_env_elements = (
["cuda" + str(self.cuda_version.major), "cudnn7"]
if self.cuda_version
else ["cpu"]
)
build_environment_string = "-".join(
["pytorch", "win"]
+ self.vscode_spec.get_elements()
+ arch_env_elements
+ ["py3"]
)
is_running_on_cuda = bool(self.cuda_version) and not self.force_on_cpu
props_dict = {
"build_environment": build_environment_string,
"python_version": miniutils.quote("3.6"),
"vc_version": miniutils.quote(self.vscode_spec.dotted_version()),
"vc_year": miniutils.quote(str(self.vscode_spec.year)),
"vc_product": self.vscode_spec.get_product(),
"use_cuda": miniutils.quote(str(int(is_running_on_cuda))),
"requires": prerequisite_jobs,
}
if self.master_only_pred(self):
props_dict[
"filters"
] = cimodel.data.simple.util.branch_filters.gen_filter_dict()
name_parts = base_name_parts + cpu_forcing_name_parts + [numbered_phase]
if base_phase == "test":
test_name = "-".join(["pytorch", "windows", numbered_phase])
props_dict["test_name"] = test_name
if is_running_on_cuda:
props_dict["executor"] = "windows-with-nvidia-gpu"
props_dict["cuda_version"] = (
miniutils.quote(str(self.cuda_version.major))
if self.cuda_version
else "cpu"
)
props_dict["name"] = "_".join(name_parts)
return [{key_name: props_dict}]
class VcSpec:
def __init__(self, year, version_elements=None):
self.year = year
self.version_elements = version_elements or []
def get_elements(self):
return [self.prefixed_year()] + self.version_elements
def get_product(self):
return "Community" if self.year == 2019 else "BuildTools"
def dotted_version(self):
return ".".join(self.version_elements)
def prefixed_year(self):
return "vs" + str(self.year)
def render(self):
return "_".join(filter(None, [self.prefixed_year(), self.dotted_version()]))
def FalsePred(_):
return False
def TruePred(_):
return True
WORKFLOW_DATA = [
# VS2017 CUDA-10.1
WindowsJob(None, VcSpec(2017, ["14", "11"]), CudaVersion(10, 1), master_only_pred=FalsePred),
WindowsJob(1, VcSpec(2017, ["14", "11"]), CudaVersion(10, 1)),
# VS2017 no-CUDA (builds only)
WindowsJob(None, VcSpec(2017, ["14", "16"]), CudaVersion(10, 1)),
WindowsJob(None, VcSpec(2017, ["14", "16"]), None),
# VS2019 CUDA-10.1
WindowsJob(None, VcSpec(2019), CudaVersion(10, 1)),
WindowsJob(1, VcSpec(2019), CudaVersion(10, 1)),
WindowsJob(2, VcSpec(2019), CudaVersion(10, 1)),
# VS2019 CPU-only
WindowsJob(None, VcSpec(2019), None),
WindowsJob(1, VcSpec(2019), None),
WindowsJob(2, VcSpec(2019), None, master_only_pred=TruePred),
WindowsJob(1, VcSpec(2019), CudaVersion(10, 1), force_on_cpu=True),
WindowsJob(2, VcSpec(2019), CudaVersion(10, 1), force_on_cpu=True, master_only_pred=TruePred),
]
def get_windows_workflows():
return [item.gen_tree() for item in WORKFLOW_DATA]

View File

@ -1,5 +1,7 @@
from collections import OrderedDict
import cimodel.lib.miniutils as miniutils
LIST_MARKER = "- "
INDENTATION_WIDTH = 2
@ -29,7 +31,8 @@ def render(fh, data, depth, is_list_member=False):
tuples.sort()
for i, (k, v) in enumerate(tuples):
if not v:
continue
# If this dict is itself a list member, the first key gets prefixed with a list marker
list_marker_prefix = LIST_MARKER if is_list_member and not i else ""
@ -43,5 +46,7 @@ def render(fh, data, depth, is_list_member=False):
render(fh, v, depth, True)
else:
# use empty quotes to denote an empty string value instead of blank space
modified_data = miniutils.quote(data) if data == "" else data
list_member_prefix = indentation + LIST_MARKER if is_list_member else ""
fh.write(list_member_prefix + str(data) + "\n")
fh.write(list_member_prefix + str(modified_data) + "\n")

View File

@ -1,84 +0,0 @@
"""
This module encapsulates dependencies on pygraphviz
"""
import colorsys
import cimodel.lib.conf_tree as conf_tree
def rgb2hex(rgb_tuple):
def to_hex(f):
return "%02x" % int(f * 255)
return "#" + "".join(map(to_hex, list(rgb_tuple)))
def handle_missing_graphviz(f):
"""
If the user has not installed pygraphviz, this causes
calls to the draw() method of the returned object to do nothing.
"""
try:
import pygraphviz # noqa: F401
return f
except ModuleNotFoundError:
class FakeGraph:
def draw(self, *args, **kwargs):
pass
return lambda _: FakeGraph()
@handle_missing_graphviz
def generate_graph(toplevel_config_node):
"""
Traverses the graph once first just to find the max depth
"""
config_list = conf_tree.dfs(toplevel_config_node)
max_depth = 0
for config in config_list:
max_depth = max(max_depth, config.get_depth())
# color the nodes using the max depth
from pygraphviz import AGraph
dot = AGraph()
def node_discovery_callback(node, sibling_index, sibling_count):
depth = node.get_depth()
sat_min, sat_max = 0.1, 0.6
sat_range = sat_max - sat_min
saturation_fraction = sibling_index / float(sibling_count - 1) if sibling_count > 1 else 1
saturation = sat_min + sat_range * saturation_fraction
# TODO Use a hash of the node label to determine the color
hue = depth / float(max_depth + 1)
rgb_tuple = colorsys.hsv_to_rgb(hue, saturation, 1)
this_node_key = node.get_node_key()
dot.add_node(
this_node_key,
label=node.get_label(),
style="filled",
# fillcolor=hex_color + ":orange",
fillcolor=rgb2hex(rgb_tuple),
penwidth=3,
color=rgb2hex(colorsys.hsv_to_rgb(hue, saturation, 0.9))
)
def child_callback(node, child):
this_node_key = node.get_node_key()
child_node_key = child.get_node_key()
dot.add_edge((this_node_key, child_node_key))
conf_tree.dfs_recurse(toplevel_config_node, lambda x: None, node_discovery_callback, child_callback)
return dot

View File

@ -0,0 +1,17 @@
#!/bin/bash -xe
YAML_FILENAME=verbatim-sources/workflows-pytorch-ge-config-tests.yml
DIFF_TOOL=meld
# Allows this script to be invoked from any directory:
cd $(dirname "$0")
pushd ..
$DIFF_TOOL $YAML_FILENAME <(./codegen_validation/normalize_yaml_fragment.py < $YAML_FILENAME)
popd

View File

@ -0,0 +1,24 @@
#!/usr/bin/env python3
import os
import sys
import yaml
# Need to import modules that lie on an upward-relative path
sys.path.append(os.path.join(sys.path[0], '..'))
import cimodel.lib.miniyaml as miniyaml
def regurgitate(depth, use_pyyaml_formatter=False):
data = yaml.safe_load(sys.stdin)
if use_pyyaml_formatter:
output = yaml.dump(data, sort_keys=True)
sys.stdout.write(output)
else:
miniyaml.render(sys.stdout, data, depth)
if __name__ == "__main__":
regurgitate(3)

View File

@ -0,0 +1,15 @@
#!/bin/bash -xe
YAML_FILENAME=$1
# Allows this script to be invoked from any directory:
cd $(dirname "$0")
pushd ..
TEMP_FILENAME=$(mktemp)
cat $YAML_FILENAME | ./codegen_validation/normalize_yaml_fragment.py > $TEMP_FILENAME
mv $TEMP_FILENAME $YAML_FILENAME
popd

File diff suppressed because it is too large Load Diff

View File

@ -15,6 +15,8 @@ OS="ubuntu"
DOCKERFILE="${OS}/Dockerfile"
if [[ "$image" == *-cuda* ]]; then
DOCKERFILE="${OS}-cuda/Dockerfile"
elif [[ "$image" == *-rocm* ]]; then
DOCKERFILE="${OS}-rocm/Dockerfile"
fi
if [[ "$image" == *-trusty* ]]; then
@ -25,32 +27,20 @@ elif [[ "$image" == *-artful* ]]; then
UBUNTU_VERSION=17.10
elif [[ "$image" == *-bionic* ]]; then
UBUNTU_VERSION=18.04
elif [[ "$image" == *-focal* ]]; then
UBUNTU_VERSION=20.04
fi
TRAVIS_DL_URL_PREFIX="https://s3.amazonaws.com/travis-python-archives/binaries/ubuntu/14.04/x86_64"
# It's annoying to rename jobs every time you want to rewrite a
# configuration, so we hardcode everything here rather than do it
# from scratch
case "$image" in
pytorch-linux-bionic-clang9-thrift-llvmdev)
CLANG_VERSION=9
THRIFT=yes
LLVMDEV=yes
PROTOBUF=yes
;;
pytorch-linux-xenial-py2.7.9)
TRAVIS_PYTHON_VERSION=2.7.9
GCC_VERSION=7
# Do not install PROTOBUF, DB, and VISION as a test
;;
pytorch-linux-xenial-py2.7)
TRAVIS_PYTHON_VERSION=2.7
GCC_VERSION=7
PROTOBUF=yes
DB=yes
VISION=yes
;;
pytorch-linux-xenial-py3.5)
TRAVIS_PYTHON_VERSION=3.5
pytorch-linux-xenial-py3.8)
# TODO: This is a hack, get rid of this as soon as you get rid of the travis downloads
TRAVIS_DL_URL_PREFIX="https://s3.amazonaws.com/travis-python-archives/binaries/ubuntu/16.04/x86_64"
TRAVIS_PYTHON_VERSION=3.8
GCC_VERSION=7
# Do not install PROTOBUF, DB, and VISION as a test
;;
@ -67,6 +57,7 @@ case "$image" in
PROTOBUF=yes
DB=yes
VISION=yes
KATEX=yes
;;
pytorch-linux-xenial-py3.6-gcc7.2)
ANACONDA_PYTHON_VERSION=3.6
@ -87,39 +78,15 @@ case "$image" in
DB=yes
VISION=yes
;;
pytorch-linux-xenial-cuda8-cudnn7-py2)
CUDA_VERSION=8.0
CUDNN_VERSION=7
ANACONDA_PYTHON_VERSION=2.7
PROTOBUF=yes
DB=yes
VISION=yes
;;
pytorch-linux-xenial-cuda8-cudnn7-py3)
CUDA_VERSION=8.0
pytorch-linux-xenial-cuda9.2-cudnn7-py3-gcc5.4)
CUDA_VERSION=9.2
CUDNN_VERSION=7
ANACONDA_PYTHON_VERSION=3.6
GCC_VERSION=5
PROTOBUF=yes
DB=yes
VISION=yes
;;
pytorch-linux-xenial-cuda9-cudnn7-py2)
CUDA_VERSION=9.0
CUDNN_VERSION=7
ANACONDA_PYTHON_VERSION=2.7
PROTOBUF=yes
DB=yes
VISION=yes
;;
pytorch-linux-xenial-cuda9-cudnn7-py3)
CUDA_VERSION=9.0
CUDNN_VERSION=7
ANACONDA_PYTHON_VERSION=3.6
PROTOBUF=yes
DB=yes
VISION=yes
KATEX=yes
;;
pytorch-linux-xenial-cuda9.2-cudnn7-py3-gcc7)
CUDA_VERSION=9.2
CUDNN_VERSION=7
@ -146,6 +113,28 @@ case "$image" in
PROTOBUF=yes
DB=yes
VISION=yes
KATEX=yes
;;
pytorch-linux-xenial-cuda10.2-cudnn7-py3-gcc7)
CUDA_VERSION=10.2
CUDNN_VERSION=7
ANACONDA_PYTHON_VERSION=3.6
GCC_VERSION=7
PROTOBUF=yes
DB=yes
VISION=yes
KATEX=yes
;;
pytorch-linux-xenial-cuda11.0-cudnn8-py3-gcc7)
UBUNTU_VERSION=16.04-rc
CUDA_VERSION=11.0
CUDNN_VERSION=8
ANACONDA_PYTHON_VERSION=3.6
GCC_VERSION=7
PROTOBUF=yes
DB=yes
VISION=yes
KATEX=yes
;;
pytorch-linux-xenial-py3-clang5-asan)
ANACONDA_PYTHON_VERSION=3.6
@ -157,6 +146,7 @@ case "$image" in
pytorch-linux-xenial-py3-clang5-android-ndk-r19c)
ANACONDA_PYTHON_VERSION=3.6
CLANG_VERSION=5.0
LLVMDEV=yes
PROTOBUF=yes
ANDROID=yes
ANDROID_NDK_VERSION=r19c
@ -171,6 +161,76 @@ case "$image" in
DB=yes
VISION=yes
;;
pytorch-linux-bionic-py3.6-clang9)
ANACONDA_PYTHON_VERSION=3.6
CLANG_VERSION=9
PROTOBUF=yes
DB=yes
VISION=yes
;;
pytorch-linux-bionic-py3.8-gcc9)
ANACONDA_PYTHON_VERSION=3.8
GCC_VERSION=9
PROTOBUF=yes
DB=yes
VISION=yes
;;
pytorch-linux-bionic-cuda10.2-cudnn7-py3.6-clang9)
CUDA_VERSION=10.2
CUDNN_VERSION=7
ANACONDA_PYTHON_VERSION=3.6
CLANG_VERSION=9
PROTOBUF=yes
DB=yes
VISION=yes
;;
pytorch-linux-bionic-cuda10.2-cudnn7-py3.8-gcc9)
CUDA_VERSION=10.2
CUDNN_VERSION=7
ANACONDA_PYTHON_VERSION=3.8
GCC_VERSION=9
PROTOBUF=yes
DB=yes
VISION=yes
;;
pytorch-linux-bionic-cuda11.0-cudnn8-py3.6-gcc9)
UBUNTU_VERSION=18.04-rc
CUDA_VERSION=11.0
CUDNN_VERSION=8
ANACONDA_PYTHON_VERSION=3.6
GCC_VERSION=9
PROTOBUF=yes
DB=yes
VISION=yes
KATEX=yes
;;
pytorch-linux-bionic-cuda11.0-cudnn8-py3.8-gcc9)
UBUNTU_VERSION=18.04-rc
CUDA_VERSION=11.0
CUDNN_VERSION=8
ANACONDA_PYTHON_VERSION=3.8
GCC_VERSION=9
PROTOBUF=yes
DB=yes
VISION=yes
KATEX=yes
;;
pytorch-linux-xenial-rocm3.3-py3.6)
ANACONDA_PYTHON_VERSION=3.6
PROTOBUF=yes
DB=yes
VISION=yes
ROCM_VERSION=3.3
# newer cmake version required
CMAKE_VERSION=3.6.3
;;
pytorch-linux-bionic-rocm3.3-py3.6)
ANACONDA_PYTHON_VERSION=3.6
PROTOBUF=yes
DB=yes
VISION=yes
ROCM_VERSION=3.3
;;
esac
# Set Jenkins UID and GID if running Jenkins
@ -182,8 +242,12 @@ fi
tmp_tag="tmp-$(cat /dev/urandom | tr -dc 'a-z' | fold -w 32 | head -n 1)"
# Build image
# TODO: build-arg THRIFT is not turned on for any image, remove it once we confirm
# it's no longer needed.
docker build \
--no-cache \
--progress=plain \
--build-arg "TRAVIS_DL_URL_PREFIX=${TRAVIS_DL_URL_PREFIX}" \
--build-arg "BUILD_ENVIRONMENT=${image}" \
--build-arg "PROTOBUF=${PROTOBUF:-}" \
--build-arg "THRIFT=${THRIFT:-}" \
@ -207,6 +271,7 @@ docker build \
--build-arg "CMAKE_VERSION=${CMAKE_VERSION:-}" \
--build-arg "NINJA_VERSION=${NINJA_VERSION:-}" \
--build-arg "KATEX=${KATEX:-}" \
--build-arg "ROCM_VERSION=${ROCM_VERSION:-}" \
-f $(dirname ${DOCKERFILE})/Dockerfile \
-t "$tmp_tag" \
"$@" \

View File

@ -45,5 +45,9 @@ trap "docker logout ${registry}" EXIT
docker push "${image}:${tag}"
# TODO: Get rid of duplicate tagging once ${DOCKER_TAG} becomes the default
docker tag "${image}:${tag}" "${image}:${DOCKER_TAG}"
docker push "${image}:${DOCKER_TAG}"
docker save -o "${IMAGE_NAME}:${tag}.tar" "${image}:${tag}"
aws s3 cp "${IMAGE_NAME}:${tag}.tar" "s3://ossci-linux-build/pytorch/base/${IMAGE_NAME}:${tag}.tar" --acl public-read

View File

@ -10,7 +10,7 @@ apt-get autoclean && apt-get clean
rm -rf /var/lib/apt/lists/* /tmp/* /var/tmp/*
pushd /tmp
curl -Os https://dl.google.com/android/repository/android-ndk-${ANDROID_NDK}-linux-x86_64.zip
curl -Os --retry 3 https://dl.google.com/android/repository/android-ndk-${ANDROID_NDK}-linux-x86_64.zip
popd
_ndk_dir=/opt/ndk
mkdir -p "$_ndk_dir"

View File

@ -2,17 +2,16 @@
set -ex
if [[ "$UBUNTU_VERSION" == "14.04" ]]; then
# cmake 2 is too old
cmake3=cmake3
else
cmake3=cmake
fi
if [[ "$UBUNTU_VERSION" == "18.04" ]]; then
# NVIDIA dockers for RC releases use tag names like `11.0-cudnn8-devel-ubuntu18.04-rc`,
# for this case we will set UBUNTU_VERSION to `18.04-rc` so that the Dockerfile could
# find the correct image. As a result, here we have to check for
# "$UBUNTU_VERSION" == "18.04"*
# instead of
# "$UBUNTU_VERSION" == "18.04"
if [[ "$UBUNTU_VERSION" == "18.04"* ]]; then
cmake3="cmake=3.10*"
else
cmake3="${cmake3}=3.5*"
cmake3="cmake=3.5*"
fi
# Install common dependencies
@ -51,14 +50,15 @@ apt-get install -y --no-install-recommends \
# Install Valgrind separately since the apt-get version is too old.
mkdir valgrind_build && cd valgrind_build
if ! wget http://valgrind.org/downloads/valgrind-3.14.0.tar.bz2
VALGRIND_VERSION=3.15.0
if ! wget http://valgrind.org/downloads/valgrind-${VALGRIND_VERSION}.tar.bz2
then
wget https://sourceware.org/ftp/valgrind/valgrind-3.14.0.tar.bz2
wget https://sourceware.org/ftp/valgrind/valgrind-${VALGRIND_VERSION}.tar.bz2
fi
tar -xjf valgrind-3.14.0.tar.bz2
cd valgrind-3.14.0
tar -xjf valgrind-${VALGRIND_VERSION}.tar.bz2
cd valgrind-${VALGRIND_VERSION}
./configure --prefix=/usr/local
make
make -j 4
sudo make install
cd ../../
rm -rf valgrind_build

View File

@ -8,7 +8,7 @@ sed -e 's|PATH="\(.*\)"|PATH="/opt/cache/bin:\1"|g' -i /etc/environment
export PATH="/opt/cache/bin:$PATH"
# Setup compiler cache
curl https://s3.amazonaws.com/ossci-linux/sccache -o /opt/cache/bin/sccache
curl --retry 3 https://s3.amazonaws.com/ossci-linux/sccache -o /opt/cache/bin/sccache
chmod a+x /opt/cache/bin/sccache
function write_sccache_stub() {

View File

@ -10,7 +10,7 @@ file="cmake-${CMAKE_VERSION}-Linux-x86_64.tar.gz"
# Download and install specific CMake version in /usr/local
pushd /tmp
curl -Os "https://cmake.org/files/${path}/${file}"
curl -Os --retry 3 "https://cmake.org/files/${path}/${file}"
tar -C /usr/local --strip-components 1 --no-same-owner -zxf cmake-*.tar.gz
rm -f cmake-*.tar.gz
popd

View File

@ -4,7 +4,7 @@ set -ex
# Optionally install conda
if [ -n "$ANACONDA_PYTHON_VERSION" ]; then
BASE_URL="https://repo.continuum.io/miniconda"
BASE_URL="https://repo.anaconda.com/miniconda"
MAJOR_PYTHON_VERSION=$(echo "$ANACONDA_PYTHON_VERSION" | cut -d . -f 1)
@ -64,19 +64,21 @@ if [ -n "$ANACONDA_PYTHON_VERSION" ]; then
# Install PyTorch conda deps, as per https://github.com/pytorch/pytorch README
# DO NOT install cmake here as it would install a version newer than 3.5, but
# we want to pin to version 3.5.
conda_install numpy pyyaml mkl mkl-include setuptools cffi typing future six
if [[ "$CUDA_VERSION" == 8.0* ]]; then
conda_install magma-cuda80 -c pytorch
elif [[ "$CUDA_VERSION" == 9.0* ]]; then
conda_install magma-cuda90 -c pytorch
elif [[ "$CUDA_VERSION" == 9.1* ]]; then
conda_install magma-cuda91 -c pytorch
elif [[ "$CUDA_VERSION" == 9.2* ]]; then
if [ "$ANACONDA_PYTHON_VERSION" = "3.8" ]; then
# DO NOT install typing if installing python-3.8, since its part of python-3.8 core packages
# Install llvm-8 as it is required to compile llvmlite-0.30.0 from source
conda_install numpy pyyaml mkl mkl-include setuptools cffi future six llvmdev=8.0.0
else
conda_install numpy pyyaml mkl mkl-include setuptools cffi typing future six
fi
if [[ "$CUDA_VERSION" == 9.2* ]]; then
conda_install magma-cuda92 -c pytorch
elif [[ "$CUDA_VERSION" == 10.0* ]]; then
conda_install magma-cuda100 -c pytorch
elif [[ "$CUDA_VERSION" == 10.1* ]]; then
conda_install magma-cuda101 -c pytorch
elif [[ "$CUDA_VERSION" == 10.2* ]]; then
conda_install magma-cuda102 -c pytorch
fi
# TODO: This isn't working atm
@ -88,7 +90,7 @@ if [ -n "$ANACONDA_PYTHON_VERSION" ]; then
# scikit-learn is pinned because of
# https://github.com/scikit-learn/scikit-learn/issues/14485 (affects gcc 5.5
# only)
as_jenkins pip install --progress-bar off pytest scipy==1.1.0 scikit-learn==0.20.3 scikit-image librosa>=0.6.2 psutil numba==0.43.1 llvmlite==0.28.0
as_jenkins pip install --progress-bar off pytest scipy==1.1.0 scikit-learn==0.20.3 scikit-image librosa>=0.6.2 psutil numba==0.46.0 llvmlite==0.30.0
popd
fi

View File

@ -7,7 +7,11 @@ if [ -n "$GCC_VERSION" ]; then
# Need the official toolchain repo to get alternate packages
add-apt-repository ppa:ubuntu-toolchain-r/test
apt-get update
apt-get install -y g++-$GCC_VERSION
if [ "$UBUNTU_VERSION" = "16.04" -a "$GCC_VERSION" = "5" ]; then
apt-get install -y g++-5=5.4.0-6ubuntu1~16.04.12
else
apt-get install -y g++-$GCC_VERSION
fi
update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-"$GCC_VERSION" 50
update-alternatives --install /usr/bin/g++ g++ /usr/bin/g++-"$GCC_VERSION" 50

View File

@ -0,0 +1,30 @@
#!/bin/bash
set -ex
llvm_url="https://github.com/llvm/llvm-project/releases/download/llvmorg-9.0.1/llvm-9.0.1.src.tar.xz"
mkdir /opt/llvm
pushd /tmp
wget --no-verbose --output-document=llvm.tar.xz "$llvm_url"
mkdir llvm
tar -xf llvm.tar.xz -C llvm --strip-components 1
rm -f llvm.tar.xz
cd llvm
mkdir build
cd build
cmake -G "Unix Makefiles" \
-DCMAKE_BUILD_TYPE=MinSizeRel \
-DLLVM_ENABLE_ASSERTIONS=ON \
-DCMAKE_INSTALL_PREFIX=/opt/llvm \
-DLLVM_TARGETS_TO_BUILD="host" \
-DLLVM_BUILD_TOOLS=OFF \
-DLLVM_BUILD_UTILS=OFF \
-DLLVM_TEMPORARILY_ALLOW_OLD_TOOLCHAIN=ON \
../
make -j4
sudo make install
popd

View File

@ -0,0 +1,89 @@
#!/bin/bash
set -ex
install_ubuntu() {
apt-get update
if [[ $UBUNTU_VERSION == 18.04 ]]; then
# gpg-agent is not available by default on 18.04
apt-get install -y --no-install-recommends gpg-agent
fi
apt-get install -y wget
apt-get install -y libopenblas-dev
# Need the libc++1 and libc++abi1 libraries to allow torch._C to load at runtime
apt-get install -y libc++1
apt-get install -y libc++abi1
DEB_ROCM_REPO=http://repo.radeon.com/rocm/apt/${ROCM_VERSION}
# Add rocm repository
wget -qO - $DEB_ROCM_REPO/rocm.gpg.key | apt-key add -
echo "deb [arch=amd64] $DEB_ROCM_REPO xenial main" > /etc/apt/sources.list.d/rocm.list
apt-get update --allow-insecure-repositories
DEBIAN_FRONTEND=noninteractive apt-get install -y --allow-unauthenticated \
rocm-dev \
rocm-utils \
rocfft \
miopen-hip \
rocblas \
hipsparse \
rocrand \
hipcub \
rocthrust \
rccl \
rocprofiler-dev \
roctracer-dev
# Cleanup
apt-get autoclean && apt-get clean
rm -rf /var/lib/apt/lists/* /tmp/* /var/tmp/*
}
install_centos() {
yum update -y
yum install -y wget
yum install -y openblas-devel
yum install -y epel-release
yum install -y dkms kernel-headers-`uname -r` kernel-devel-`uname -r`
echo "[ROCm]" > /etc/yum.repos.d/rocm.repo
echo "name=ROCm" >> /etc/yum.repos.d/rocm.repo
echo "baseurl=http://repo.radeon.com/rocm/yum/rpm/" >> /etc/yum.repos.d/rocm.repo
echo "enabled=1" >> /etc/yum.repos.d/rocm.repo
echo "gpgcheck=0" >> /etc/yum.repos.d/rocm.repo
yum update -y
yum install -y \
rocm-dev \
rocm-utils \
rocfft \
miopen-hip \
rocblas \
hipsparse \
rocrand \
rccl \
hipcub \
rocthrust \
rocprofiler-dev \
roctracer-dev
# Cleanup
yum clean all
rm -rf /var/cache/yum
rm -rf /var/lib/yum/yumdb
rm -rf /var/lib/yum/history
}
# Install Python packages depending on the base OS
if [ -f /etc/lsb-release ]; then
install_ubuntu
elif [ -f /etc/os-release ]; then
install_centos
else
echo "Unable to determine OS..."
exit 1
fi

View File

@ -14,7 +14,7 @@ if [ -n "$TRAVIS_PYTHON_VERSION" ]; then
# Download Python binary from Travis
pushd tmp
as_jenkins wget --quiet https://s3.amazonaws.com/travis-python-archives/binaries/ubuntu/14.04/x86_64/python-$TRAVIS_PYTHON_VERSION.tar.bz2
as_jenkins wget --quiet ${TRAVIS_DL_URL_PREFIX}/python-$TRAVIS_PYTHON_VERSION.tar.bz2
# NB: The tarball also comes with /home/travis virtualenv that we
# don't care about. (Maybe we should, but we've worked around the
# "how do I install to python" issue by making this entire directory
@ -88,6 +88,9 @@ if [ -n "$TRAVIS_PYTHON_VERSION" ]; then
# Install psutil for dataloader tests
as_jenkins pip install psutil
# Install dill for serialization tests
as_jenkins pip install "dill>=0.3.1"
# Cleanup package manager
apt-get autoclean && apt-get clean
rm -rf /var/lib/apt/lists/* /tmp/* /var/tmp/*

View File

@ -35,6 +35,11 @@ ARG GCC_VERSION
ADD ./common/install_gcc.sh install_gcc.sh
RUN bash ./install_gcc.sh && rm install_gcc.sh
# Install clang
ARG CLANG_VERSION
ADD ./common/install_clang.sh install_clang.sh
RUN bash ./install_clang.sh && rm install_clang.sh
# Install non-standard Python versions (via Travis binaries)
ARG TRAVIS_PYTHON_VERSION
ENV PATH /opt/python/$TRAVIS_PYTHON_VERSION/bin:$PATH
@ -81,5 +86,9 @@ ENV BUILD_ENVIRONMENT ${BUILD_ENVIRONMENT}
ENV TORCH_CUDA_ARCH_LIST Maxwell
ENV TORCH_NVCC_FLAGS "-Xfatbin -compress-all"
# Install LLVM dev version
ADD ./common/install_llvm.sh install_llvm.sh
RUN bash ./install_llvm.sh
USER jenkins
CMD ["bash"]

View File

@ -0,0 +1 @@
*.sh

View File

@ -0,0 +1,86 @@
ARG UBUNTU_VERSION
FROM ubuntu:${UBUNTU_VERSION}
ARG UBUNTU_VERSION
ENV DEBIAN_FRONTEND noninteractive
# Install common dependencies (so that this step can be cached separately)
ARG EC2
ADD ./common/install_base.sh install_base.sh
RUN bash ./install_base.sh && rm install_base.sh
# Install clang
ARG LLVMDEV
ARG CLANG_VERSION
ADD ./common/install_clang.sh install_clang.sh
RUN bash ./install_clang.sh && rm install_clang.sh
# Install user
ADD ./common/install_user.sh install_user.sh
RUN bash ./install_user.sh && rm install_user.sh
# Install conda
ENV PATH /opt/conda/bin:$PATH
ARG ANACONDA_PYTHON_VERSION
ADD ./common/install_conda.sh install_conda.sh
RUN bash ./install_conda.sh && rm install_conda.sh
# (optional) Install protobuf for ONNX
ARG PROTOBUF
ADD ./common/install_protobuf.sh install_protobuf.sh
RUN if [ -n "${PROTOBUF}" ]; then bash ./install_protobuf.sh; fi
RUN rm install_protobuf.sh
ENV INSTALLED_PROTOBUF ${PROTOBUF}
# (optional) Install database packages like LMDB and LevelDB
ARG DB
ADD ./common/install_db.sh install_db.sh
RUN if [ -n "${DB}" ]; then bash ./install_db.sh; fi
RUN rm install_db.sh
ENV INSTALLED_DB ${DB}
# (optional) Install vision packages like OpenCV and ffmpeg
ARG VISION
ADD ./common/install_vision.sh install_vision.sh
RUN if [ -n "${VISION}" ]; then bash ./install_vision.sh; fi
RUN rm install_vision.sh
ENV INSTALLED_VISION ${VISION}
# Install rocm
ARG ROCM_VERSION
ADD ./common/install_rocm.sh install_rocm.sh
RUN bash ./install_rocm.sh
RUN rm install_rocm.sh
ENV PATH /opt/rocm/bin:$PATH
ENV PATH /opt/rocm/hcc/bin:$PATH
ENV PATH /opt/rocm/hip/bin:$PATH
ENV PATH /opt/rocm/opencl/bin:$PATH
ENV HIP_PLATFORM hcc
ENV LANG C.UTF-8
ENV LC_ALL C.UTF-8
# (optional) Install non-default CMake version
ARG CMAKE_VERSION
ADD ./common/install_cmake.sh install_cmake.sh
RUN if [ -n "${CMAKE_VERSION}" ]; then bash ./install_cmake.sh; fi
RUN rm install_cmake.sh
# (optional) Install non-default Ninja version
ARG NINJA_VERSION
ADD ./common/install_ninja.sh install_ninja.sh
RUN if [ -n "${NINJA_VERSION}" ]; then bash ./install_ninja.sh; fi
RUN rm install_ninja.sh
# Install ccache/sccache (do this last, so we get priority in PATH)
ADD ./common/install_cache.sh install_cache.sh
ENV PATH /opt/cache/bin:$PATH
RUN bash ./install_cache.sh && rm install_cache.sh
# Include BUILD_ENVIRONMENT environment variable in image
ARG BUILD_ENVIRONMENT
ENV BUILD_ENVIRONMENT ${BUILD_ENVIRONMENT}
USER jenkins
CMD ["bash"]

View File

@ -46,6 +46,7 @@ RUN bash ./install_gcc.sh && rm install_gcc.sh
# Install non-standard Python versions (via Travis binaries)
ARG TRAVIS_PYTHON_VERSION
ARG TRAVIS_DL_URL_PREFIX
ENV PATH /opt/python/$TRAVIS_PYTHON_VERSION/bin:$PATH
ADD ./common/install_travis_python.sh install_travis_python.sh
RUN bash ./install_travis_python.sh && rm install_travis_python.sh
@ -110,5 +111,9 @@ RUN bash ./install_jni.sh && rm install_jni.sh
ARG BUILD_ENVIRONMENT
ENV BUILD_ENVIRONMENT ${BUILD_ENVIRONMENT}
# Install LLVM dev version
ADD ./common/install_llvm.sh install_llvm.sh
RUN bash ./install_llvm.sh
USER jenkins
CMD ["bash"]

View File

@ -0,0 +1,13 @@
FROM ubuntu:16.04
RUN apt-get update && apt-get install -y python-pip git && rm -rf /var/lib/apt/lists/* /var/log/dpkg.log
ADD requirements.txt /requirements.txt
RUN pip install -r /requirements.txt
ADD gc.py /usr/bin/gc.py
ADD docker_hub.py /usr/bin/docker_hub.py
ENTRYPOINT ["/usr/bin/gc.py"]

View File

@ -0,0 +1,125 @@
#!/usr/bin/env python
from collections import namedtuple
import boto3
import requests
import os
IMAGE_INFO = namedtuple(
"IMAGE_INFO", ("repo", "tag", "size", "last_updated_at", "last_updated_by")
)
def build_access_token(username, passwordtr):
r = requests.post(
"https://hub.docker.com/v2/users/login/",
data={"username": username, "password": password},
)
r.raise_for_status()
token = r.json().get("token")
return {"Authorization": "JWT " + token}
def list_repos(user, token):
r = requests.get("https://hub.docker.com/v2/repositories/" + user, headers=token)
r.raise_for_status()
ret = sorted(
repo["user"] + "/" + repo["name"] for repo in r.json().get("results", [])
)
if ret:
print("repos found:")
print("".join("\n\t" + r for r in ret))
return ret
def list_tags(repo, token):
r = requests.get(
"https://hub.docker.com/v2/repositories/" + repo + "/tags", headers=token
)
r.raise_for_status()
return [
IMAGE_INFO(
repo=repo,
tag=t["name"],
size=t["full_size"],
last_updated_at=t["last_updated"],
last_updated_by=t["last_updater_username"],
)
for t in r.json().get("results", [])
]
def save_to_s3(tags):
table_content = ""
client = boto3.client("s3")
for t in tags:
table_content += (
"<tr><td>{repo}</td><td>{tag}</td><td>{size}</td>"
"<td>{last_updated_at}</td><td>{last_updated_by}</td></tr>"
).format(
repo=t.repo,
tag=t.tag,
size=t.size,
last_updated_at=t.last_updated_at,
last_updated_by=t.last_updated_by,
)
html_body = """
<html>
<head>
<link rel="stylesheet"
href="https://stackpath.bootstrapcdn.com/bootstrap/4.4.1/css/bootstrap.min.css"
integrity="sha384-Vkoo8x4CGsO3+Hhxv8T/Q5PaXtkKtu6ug5TOeNV6gBiFeWPGFN9MuhOf23Q9Ifjh"
crossorigin="anonymous">
<link rel="stylesheet" type="text/css"
href="https://cdn.datatables.net/1.10.20/css/jquery.dataTables.css">
<script src="https://ajax.googleapis.com/ajax/libs/jquery/3.4.1/jquery.min.js">
</script>
<script type="text/javascript" charset="utf8"
src="https://cdn.datatables.net/1.10.20/js/jquery.dataTables.js"></script>
<title> docker image info</title>
</head>
<body>
<table class="table table-striped table-hover" id="docker">
<caption>Docker images on docker hub</caption>
<thead class="thead-dark">
<tr>
<th scope="col">repo</th>
<th scope="col">tag</th>
<th scope="col">size</th>
<th scope="col">last_updated_at</th>
<th scope="col">last_updated_by</th>
</tr>
</thead>
<tbody>
{table_content}
</tbody>
</table>
</body>
<script>
$(document).ready( function () {{
$('#docker').DataTable({{paging: false}});
}} );py
</script>
</html>
""".format(
table_content=table_content
)
client.put_object(
Bucket="docker.pytorch.org",
ACL="public-read",
Key="docker_hub.html",
Body=html_body,
ContentType="text/html",
)
if __name__ == "__main__":
username = os.environ.get("DOCKER_HUB_USERNAME")
password = os.environ.get("DOCKER_HUB_PASSWORD")
token = build_access_token(username, password)
tags = []
for repo in list_repos("pytorch", token):
tags.extend(list_tags(repo, token))
save_to_s3(tags)

214
.circleci/ecr_gc_docker/gc.py Executable file
View File

@ -0,0 +1,214 @@
#!/usr/bin/env python
import argparse
import datetime
import boto3
import pytz
import sys
import re
def save_to_s3(project, data):
table_content = ""
client = boto3.client("s3")
for repo, tag, window, age, pushed in data:
table_content += "<tr><td>{repo}</td><td>{tag}</td><td>{window}</td><td>{age}</td><td>{pushed}</td></tr>".format(
repo=repo, tag=tag, window=window, age=age, pushed=pushed
)
html_body = """
<html>
<head>
<link rel="stylesheet"
href="https://stackpath.bootstrapcdn.com/bootstrap/4.4.1/css/bootstrap.min.css"
integrity="sha384-Vkoo8x4CGsO3+Hhxv8T/Q5PaXtkKtu6ug5TOeNV6gBiFeWPGFN9MuhOf23Q9Ifjh"
crossorigin="anonymous">
<link rel="stylesheet" type="text/css" href="https://cdn.datatables.net/1.10.20/css/jquery.dataTables.css">
<script src="https://ajax.googleapis.com/ajax/libs/jquery/3.4.1/jquery.min.js"></script>
<script type="text/javascript" charset="utf8" src="https://cdn.datatables.net/1.10.20/js/jquery.dataTables.js"></script>
<title>{project} nightly and permanent docker image info</title>
</head>
<body>
<table class="table table-striped table-hover" id="docker">
<thead class="thead-dark">
<tr>
<th scope="col">repo</th>
<th scope="col">tag</th>
<th scope="col">keep window</th>
<th scope="col">age</th>
<th scope="col">pushed at</th>
</tr>
</thead>
<tbody>
{table_content}
</tbody>
</table>
</body>
<script>
$(document).ready( function () {{
$('#docker').DataTable({{paging: false}});
}} );
</script>
</html>
""".format(
project=project, table_content=table_content
)
# for pytorch, file can be found at
# http://ossci-docker.s3-website.us-east-1.amazonaws.com/pytorch.html
# and later one we can config docker.pytorch.org to point to the location
client.put_object(
Bucket="docker.pytorch.org",
ACL="public-read",
Key="{project}.html".format(project=project),
Body=html_body,
ContentType="text/html",
)
def repos(client):
paginator = client.get_paginator("describe_repositories")
pages = paginator.paginate(registryId="308535385114")
for page in pages:
for repo in page["repositories"]:
yield repo
def images(client, repository):
paginator = client.get_paginator("describe_images")
pages = paginator.paginate(
registryId="308535385114", repositoryName=repository["repositoryName"]
)
for page in pages:
for image in page["imageDetails"]:
yield image
parser = argparse.ArgumentParser(description="Delete old Docker tags from registry")
parser.add_argument(
"--dry-run", action="store_true", help="Dry run; print tags that would be deleted"
)
parser.add_argument(
"--keep-stable-days",
type=int,
default=14,
help="Days of stable Docker tags to keep (non per-build images)",
)
parser.add_argument(
"--keep-unstable-days",
type=int,
default=1,
help="Days of unstable Docker tags to keep (per-build images)",
)
parser.add_argument(
"--filter-prefix",
type=str,
default="",
help="Only run cleanup for repositories with this prefix",
)
parser.add_argument(
"--ignore-tags",
type=str,
default="",
help="Never cleanup these tags (comma separated)",
)
args = parser.parse_args()
if not args.ignore_tags or not args.filter_prefix:
print(
"""
Missing required arguments --ignore-tags and --filter-prefix
You must specify --ignore-tags and --filter-prefix to avoid accidentally
pruning a stable Docker tag which is being actively used. This will
make you VERY SAD. So pay attention.
First, which filter-prefix do you want? The list of valid prefixes
is in jobs/private.groovy under the 'docker-registry-cleanup' job.
You probably want either pytorch or caffe2.
Second, which ignore-tags do you want? It should be whatever the most
up-to-date DockerVersion for the repository in question is. Follow
the imports of jobs/pytorch.groovy to find them.
"""
)
sys.exit(1)
client = boto3.client("ecr", region_name="us-east-1")
stable_window = datetime.timedelta(days=args.keep_stable_days)
unstable_window = datetime.timedelta(days=args.keep_unstable_days)
now = datetime.datetime.now(pytz.UTC)
ignore_tags = args.ignore_tags.split(",")
def chunks(chunkable, n):
""" Yield successive n-sized chunks from l.
"""
for i in range(0, len(chunkable), n):
yield chunkable[i : i + n]
SHA_PATTERN = re.compile(r'^[0-9a-f]{40}$')
def looks_like_git_sha(tag):
"""Returns a boolean to check if a tag looks like a git sha
For reference a sha1 is 40 characters with only 0-9a-f and contains no
"-" characters
"""
return re.match(SHA_PATTERN, tag) is not None
stable_window_tags = []
for repo in repos(client):
repositoryName = repo["repositoryName"]
if not repositoryName.startswith(args.filter_prefix):
continue
# Keep list of image digests to delete for this repository
digest_to_delete = []
print(repositoryName)
for image in images(client, repo):
tags = image.get("imageTags")
if not isinstance(tags, (list,)) or len(tags) == 0:
continue
tag = tags[0]
created = image["imagePushedAt"]
age = now - created
if any([
looks_like_git_sha(tag),
tag.isdigit(),
tag.count("-") == 4, # TODO: Remove, this no longer applies as tags are now built using a SHA1
tag in ignore_tags]):
window = stable_window
if tag in ignore_tags:
stable_window_tags.append((repositoryName, tag, "", age, created))
elif age < window:
stable_window_tags.append((repositoryName, tag, window, age, created))
else:
window = unstable_window
if tag in ignore_tags:
print("Ignoring tag {}:{} (age: {})".format(repositoryName, tag, age))
continue
if age < window:
print("Not deleting manifest for tag {}:{} (age: {})".format(repositoryName, tag, age))
continue
if args.dry_run:
print("(dry run) Deleting manifest for tag {}:{} (age: {})".format(repositoryName, tag, age))
else:
print("Deleting manifest for tag{}:{} (age: {})".format(repositoryName, tag, age))
digest_to_delete.append(image["imageDigest"])
# Issue batch delete for all images to delete for this repository
# Note that as of 2018-07-25, the maximum number of images you can
# delete in a single batch is 100, so chunk our list into batches of
# 100
for c in chunks(digest_to_delete, 100):
client.batch_delete_image(
registryId="308535385114",
repositoryName=repositoryName,
imageIds=[{"imageDigest": digest} for digest in c],
)
save_to_s3(args.filter_prefix, stable_window_tags)

View File

@ -0,0 +1,3 @@
boto3
pytz
requests

View File

@ -6,13 +6,24 @@ Please see README.md in this directory for details.
"""
import os
import sys
import shutil
from collections import namedtuple, OrderedDict
import sys
from collections import OrderedDict, namedtuple
import cimodel.data.pytorch_build_definitions as pytorch_build_definitions
import cimodel.data.binary_build_definitions as binary_build_definitions
import cimodel.data.caffe2_build_definitions as caffe2_build_definitions
import cimodel.data.pytorch_build_definitions as pytorch_build_definitions
import cimodel.data.simple.android_definitions
import cimodel.data.simple.bazel_definitions
import cimodel.data.simple.binary_smoketest
import cimodel.data.simple.docker_definitions
import cimodel.data.simple.ge_config_tests
import cimodel.data.simple.ios_definitions
import cimodel.data.simple.macos_definitions
import cimodel.data.simple.mobile_definitions
import cimodel.data.simple.nightly_android
import cimodel.data.simple.nightly_ios
import cimodel.data.windows_build_definitions as windows_build_definitions
import cimodel.lib.miniutils as miniutils
import cimodel.lib.miniyaml as miniyaml
@ -21,6 +32,7 @@ class File(object):
"""
Verbatim copy the contents of a file into config.yml
"""
def __init__(self, filename):
self.filename = filename
@ -29,7 +41,7 @@ class File(object):
shutil.copyfileobj(fh, output_filehandle)
class FunctionGen(namedtuple('FunctionGen', 'function depth')):
class FunctionGen(namedtuple("FunctionGen", "function depth")):
__slots__ = ()
@ -39,15 +51,14 @@ class Treegen(FunctionGen):
"""
def write(self, output_filehandle):
build_dict = OrderedDict()
self.function(build_dict)
miniyaml.render(output_filehandle, build_dict, self.depth)
miniyaml.render(output_filehandle, self.function(), self.depth)
class Listgen(FunctionGen):
"""
Insert the content of a YAML list into config.yml
"""
def write(self, output_filehandle):
miniyaml.render(output_filehandle, self.function(), self.depth)
@ -57,7 +68,6 @@ def horizontal_rule():
class Header(object):
def __init__(self, title, summary=None):
self.title = title
self.summary_lines = summary or []
@ -71,43 +81,81 @@ class Header(object):
output_filehandle.write(line + "\n")
def gen_build_workflows_tree():
build_workflows_functions = [
pytorch_build_definitions.get_workflow_jobs,
cimodel.data.simple.macos_definitions.get_workflow_jobs,
cimodel.data.simple.android_definitions.get_workflow_jobs,
cimodel.data.simple.ios_definitions.get_workflow_jobs,
cimodel.data.simple.mobile_definitions.get_workflow_jobs,
cimodel.data.simple.ge_config_tests.get_workflow_jobs,
cimodel.data.simple.bazel_definitions.get_workflow_jobs,
caffe2_build_definitions.get_workflow_jobs,
cimodel.data.simple.binary_smoketest.get_workflow_jobs,
cimodel.data.simple.nightly_ios.get_workflow_jobs,
cimodel.data.simple.nightly_android.get_workflow_jobs,
windows_build_definitions.get_windows_workflows,
]
binary_build_functions = [
binary_build_definitions.get_binary_build_jobs,
binary_build_definitions.get_nightly_tests,
binary_build_definitions.get_nightly_uploads,
binary_build_definitions.get_post_upload_jobs,
binary_build_definitions.get_binary_smoke_test_jobs,
]
docker_builder_functions = [
cimodel.data.simple.docker_definitions.get_workflow_jobs
]
return {
"workflows": {
"binary_builds": {
"when": r"<< pipeline.parameters.run_binary_tests >>",
"jobs": [f() for f in binary_build_functions],
},
"docker_build": OrderedDict(
{
"triggers": [
{
"schedule": {
"cron": miniutils.quote("0 15 * * 0"),
"filters": {"branches": {"only": ["master"]}},
}
}
],
"jobs": [f() for f in docker_builder_functions],
}
),
"build": {"jobs": [f() for f in build_workflows_functions]},
}
}
# Order of this list matters to the generated config.yml.
YAML_SOURCES = [
File("header-section.yml"),
File("commands.yml"),
File("nightly-binary-build-defaults.yml"),
Header("Build parameters"),
File("pytorch-build-params.yml"),
File("caffe2-build-params.yml"),
File("binary-build-params.yml"),
File("build-parameters/pytorch-build-params.yml"),
File("build-parameters/caffe2-build-params.yml"),
File("build-parameters/binary-build-params.yml"),
File("build-parameters/promote-build-params.yml"),
Header("Job specs"),
File("pytorch-job-specs.yml"),
File("caffe2-job-specs.yml"),
File("binary-job-specs.yml"),
File("job-specs-setup.yml"),
File("job-specs-custom.yml"),
File("binary_update_htmls.yml"),
File("binary-build-tests.yml"),
File("docker_build_job.yml"),
File("workflows.yml"),
Listgen(pytorch_build_definitions.get_workflow_jobs, 3),
File("workflows-pytorch-macos-builds.yml"),
File("workflows-pytorch-android-gradle-build.yml"),
File("workflows-pytorch-ios-builds.yml"),
File("workflows-pytorch-mobile-builds.yml"),
File("workflows-pytorch-ge-config-tests.yml"),
Listgen(caffe2_build_definitions.get_workflow_jobs, 3),
File("workflows-binary-builds-smoke-subset.yml"),
Listgen(binary_build_definitions.get_binary_smoke_test_jobs, 3),
Listgen(binary_build_definitions.get_binary_build_jobs, 3),
File("workflows-nightly-ios-binary-builds.yml"),
File("workflows-nightly-android-binary-builds.yml"),
Header("Nightly tests"),
Listgen(binary_build_definitions.get_nightly_tests, 3),
File("workflows-nightly-uploads-header.yml"),
Listgen(binary_build_definitions.get_nightly_uploads, 3),
File("workflows-s3-html.yml"),
File("workflows-docker-builder.yml")
File("job-specs/pytorch-job-specs.yml"),
File("job-specs/caffe2-job-specs.yml"),
File("job-specs/binary-job-specs.yml"),
File("job-specs/job-specs-custom.yml"),
File("job-specs/job-specs-promote.yml"),
File("job-specs/binary_update_htmls.yml"),
File("job-specs/binary-build-tests.yml"),
File("job-specs/docker_jobs.yml"),
Header("Workflows"),
Treegen(gen_build_workflows_tree, 0),
File("workflows/workflows-ecr-gc.yml"),
File("workflows/workflows-promote.yml"),
]

View File

@ -1,9 +1,20 @@
#!/bin/bash
set -eux -o pipefail
retry () {
$* || (sleep 1 && $*) || (sleep 2 && $*) || (sleep 4 && $*) || (sleep 8 && $*)
}
# This step runs on multiple executors with different envfile locations
if [[ "$(uname)" == Darwin ]]; then
# macos executor (builds and tests)
workdir="/Users/distiller/project"
elif [[ "$OSTYPE" == "msys" ]]; then
# windows executor (builds and tests)
rm -rf /c/w
ln -s "/c/Users/circleci/project" /c/w
workdir="/c/w"
elif [[ -d "/home/circleci/project" ]]; then
# machine executor (binary tests)
workdir="/home/circleci/project"
@ -13,11 +24,17 @@ else
fi
# It is very important that this stays in sync with binary_populate_env.sh
export PYTORCH_ROOT="$workdir/pytorch"
export BUILDER_ROOT="$workdir/builder"
if [[ "$OSTYPE" == "msys" ]]; then
# We need to make the paths as short as possible on Windows
export PYTORCH_ROOT="$workdir/p"
export BUILDER_ROOT="$workdir/b"
else
export PYTORCH_ROOT="$workdir/pytorch"
export BUILDER_ROOT="$workdir/builder"
fi
# Clone the Pytorch branch
git clone https://github.com/pytorch/pytorch.git "$PYTORCH_ROOT"
retry git clone https://github.com/pytorch/pytorch.git "$PYTORCH_ROOT"
pushd "$PYTORCH_ROOT"
if [[ -n "${CIRCLE_PR_NUMBER:-}" ]]; then
# "smoke" binary build on PRs
@ -33,13 +50,13 @@ else
echo "Can't tell what to checkout"
exit 1
fi
git submodule update --init --recursive --quiet
retry git submodule update --init --recursive
echo "Using Pytorch from "
git --no-pager log --max-count 1
popd
# Clone the Builder master repo
git clone -q https://github.com/pytorch/builder.git "$BUILDER_ROOT"
retry git clone -q https://github.com/pytorch/builder.git "$BUILDER_ROOT"
pushd "$BUILDER_ROOT"
echo "Using builder from "
git --no-pager log --max-count 1

View File

@ -31,9 +31,9 @@ fi
conda_sh="$workdir/install_miniconda.sh"
if [[ "$(uname)" == Darwin ]]; then
retry curl -o "$conda_sh" https://repo.continuum.io/miniconda/Miniconda3-latest-MacOSX-x86_64.sh
curl --retry 3 -o "$conda_sh" https://repo.anaconda.com/miniconda/Miniconda3-latest-MacOSX-x86_64.sh
else
retry curl -o "$conda_sh" https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh
curl --retry 3 -o "$conda_sh" https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
fi
chmod +x "$conda_sh"
"$conda_sh" -b -p "$MINICONDA_ROOT"

View File

@ -5,20 +5,24 @@ echo ""
echo "DIR: $(pwd)"
WORKSPACE=/Users/distiller/workspace
PROJ_ROOT=/Users/distiller/project
export TCLLIBPATH="/usr/local/lib"
export TCLLIBPATH="/usr/local/lib"
# Install conda
curl -o ~/Downloads/conda.sh https://repo.anaconda.com/miniconda/Miniconda3-latest-MacOSX-x86_64.sh
chmod +x ~/Downloads/conda.sh
/bin/bash ~/Downloads/conda.sh -b -p ~/anaconda
curl --retry 3 -o ~/conda.sh https://repo.anaconda.com/miniconda/Miniconda3-latest-MacOSX-x86_64.sh
chmod +x ~/conda.sh
/bin/bash ~/conda.sh -b -p ~/anaconda
export PATH="~/anaconda/bin:${PATH}"
source ~/anaconda/bin/activate
# Install dependencies
conda install numpy ninja pyyaml mkl mkl-include setuptools cmake cffi typing requests --yes
export CMAKE_PREFIX_PATH=${CONDA_PREFIX:-"$(dirname $(which conda))/../"}
# sync submodules
cd ${PROJ_ROOT}
git submodule sync
git submodule update --init --recursive
# run build script
chmod a+x ${PROJ_ROOT}/scripts/build_ios.sh
echo "########################################################"
@ -26,13 +30,13 @@ cat ${PROJ_ROOT}/scripts/build_ios.sh
echo "########################################################"
echo "IOS_ARCH: ${IOS_ARCH}"
echo "IOS_PLATFORM: ${IOS_PLATFORM}"
export BUILD_PYTORCH_MOBILE=1
export IOS_ARCH=${IOS_ARCH}
export IOS_PLATFORM=${IOS_PLATFORM}
unbuffer ${PROJ_ROOT}/scripts/build_ios.sh 2>&1 | ts
#store the binary
cd ${WORKSPACE}
DEST_DIR=${WORKSPACE}/ios
mkdir -p ${DEST_DIR}
cp -R ${PROJ_ROOT}/build_ios/install ${DEST_DIR}
mv ${DEST_DIR}/install ${DEST_DIR}/${IOS_ARCH}
mv ${DEST_DIR}/install ${DEST_DIR}/${IOS_ARCH}

View File

@ -14,14 +14,14 @@ mkdir -p ${ZIP_DIR}/src
cp -R ${ARTIFACTS_DIR}/arm64/include ${ZIP_DIR}/install/
# build a FAT bianry
cd ${ZIP_DIR}/install/lib
target_libs=(libc10.a libclog.a libcpuinfo.a libeigen_blas.a libpytorch_qnnpack.a libtorch.a)
target_libs=(libc10.a libclog.a libcpuinfo.a libeigen_blas.a libpthreadpool.a libpytorch_qnnpack.a libtorch_cpu.a libtorch.a libXNNPACK.a)
for lib in ${target_libs[*]}
do
libs=(${ARTIFACTS_DIR}/x86_64/lib/${lib} ${ARTIFACTS_DIR}/arm64/lib/${lib})
lipo -create "${libs[@]}" -o ${ZIP_DIR}/install/lib/${lib}
if [ -f "${ARTIFACTS_DIR}/x86_64/lib/${lib}" ] && [ -f "${ARTIFACTS_DIR}/arm64/lib/${lib}" ]; then
libs=("${ARTIFACTS_DIR}/x86_64/lib/${lib}" "${ARTIFACTS_DIR}/arm64/lib/${lib}")
lipo -create "${libs[@]}" -o ${ZIP_DIR}/install/lib/${lib}
fi
done
# for nnpack, we only support arm64 build
cp ${ARTIFACTS_DIR}/arm64/lib/libnnpack.a ./
lipo -i ${ZIP_DIR}/install/lib/*.a
# copy the umbrella header and license
cp ${PROJ_ROOT}/ios/LibTorch.h ${ZIP_DIR}/src/

View File

@ -9,13 +9,15 @@ set -eux -o pipefail
if [[ "$PACKAGE_TYPE" == conda ]]; then
retry conda create -qyn testenv python="$DESIRED_PYTHON"
source activate testenv >/dev/null
elif [[ "$DESIRED_PYTHON" == 2.7mu ]]; then
export PATH="/opt/python/cp27-cp27mu/bin:\$PATH"
elif [[ "$DESIRED_PYTHON" == 3.8m ]]; then
export PATH="/opt/python/cp38-cp38/bin:\$PATH"
elif [[ "$PACKAGE_TYPE" != libtorch ]]; then
python_nodot="\$(echo $DESIRED_PYTHON | tr -d m.u)"
export PATH="/opt/python/cp\$python_nodot-cp\${python_nodot}m/bin:\$PATH"
python_path="/opt/python/cp\$python_nodot-cp\${python_nodot}"
# Prior to Python 3.8 paths were suffixed with an 'm'
if [[ -d "\${python_path}/bin" ]]; then
export PATH="\${python_path}/bin:\$PATH"
elif [[ -d "\${python_path}m/bin" ]]; then
export PATH="\${python_path}m/bin:\$PATH"
fi
fi
# Install the package
@ -28,11 +30,11 @@ pkg="/final_pkgs/\$(ls /final_pkgs)"
if [[ "$PACKAGE_TYPE" == conda ]]; then
conda install -y "\$pkg" --offline
if [[ "$DESIRED_CUDA" == 'cpu' ]]; then
conda install -y cpuonly -c pytorch
retry conda install -y cpuonly -c pytorch
fi
retry conda install -yq future numpy protobuf six
if [[ "$DESIRED_CUDA" != 'cpu' ]]; then
# DESIRED_CUDA is in format cu90 or cu100
# DESIRED_CUDA is in format cu90 or cu102
if [[ "${#DESIRED_CUDA}" == 4 ]]; then
cu_ver="${DESIRED_CUDA:2:1}.${DESIRED_CUDA:3}"
else
@ -52,6 +54,7 @@ fi
# Test the package
/builder/check_binary.sh
# =================== The above code will be executed inside Docker container ===================
EOL
echo

View File

@ -5,15 +5,6 @@ set -eu -o pipefail
set +x
declare -x "AWS_ACCESS_KEY_ID=${PYTORCH_BINARY_AWS_ACCESS_KEY_ID}"
declare -x "AWS_SECRET_ACCESS_KEY=${PYTORCH_BINARY_AWS_SECRET_ACCESS_KEY}"
cat >/home/circleci/project/login_to_anaconda.sh <<EOL
set +x
echo "Trying to login to Anaconda"
yes | anaconda login \
--username "$PYTORCH_BINARY_PJH5_CONDA_USERNAME" \
--password "$PYTORCH_BINARY_PJH5_CONDA_PASSWORD"
set -x
EOL
chmod +x /home/circleci/project/login_to_anaconda.sh
#!#!#!#!#!#!#!#!#!#!#!#!#!#!#!#!#!#!#!#!#!#!#!#!
# DO NOT TURN -x ON BEFORE THIS LINE
@ -21,20 +12,37 @@ chmod +x /home/circleci/project/login_to_anaconda.sh
set -eux -o pipefail
export PATH="$MINICONDA_ROOT/bin:$PATH"
# This gets set in binary_populate_env.sh, but lets have a sane default just in case
PIP_UPLOAD_FOLDER=${PIP_UPLOAD_FOLDER:-nightly}
# TODO: Combine CONDA_UPLOAD_CHANNEL and PIP_UPLOAD_FOLDER into one variable
# The only difference is the trailing slash
# Strip trailing slashes if there
CONDA_UPLOAD_CHANNEL=$(echo "${PIP_UPLOAD_FOLDER}" | sed 's:/*$::')
BACKUP_BUCKET="s3://pytorch-backup"
retry pip install -q awscli
# Upload the package to the final location
pushd /home/circleci/project/final_pkgs
if [[ "$PACKAGE_TYPE" == conda ]]; then
retry conda install -yq anaconda-client
retry timeout 30 /home/circleci/project/login_to_anaconda.sh
anaconda upload "$(ls)" -u pytorch-nightly --label main --no-progress --force
retry anaconda -t "${CONDA_PYTORCHBOT_TOKEN}" upload "$(ls)" -u "pytorch-${CONDA_UPLOAD_CHANNEL}" --label main --no-progress --force
# Fetch platform (eg. win-64, linux-64, etc.) from index file
# Because there's no actual conda command to read this
subdir=$(tar -xOf ./*.bz2 info/index.json | grep subdir | cut -d ':' -f2 | sed -e 's/[[:space:]]//' -e 's/"//g' -e 's/,//')
BACKUP_DIR="conda/${subdir}"
elif [[ "$PACKAGE_TYPE" == libtorch ]]; then
retry pip install -q awscli
s3_dir="s3://pytorch/libtorch/${PIP_UPLOAD_FOLDER}${DESIRED_CUDA}/"
for pkg in $(ls); do
retry aws s3 cp "$pkg" "$s3_dir" --acl public-read
done
BACKUP_DIR="libtorch/${PIP_UPLOAD_FOLDER}${DESIRED_CUDA}/"
else
retry pip install -q awscli
s3_dir="s3://pytorch/whl/${PIP_UPLOAD_FOLDER}${DESIRED_CUDA}/"
retry aws s3 cp "$(ls)" "$s3_dir" --acl public-read
BACKUP_DIR="whl/${PIP_UPLOAD_FOLDER}${DESIRED_CUDA}/"
fi
if [[ -n "${CIRCLE_TAG:-}" ]]; then
s3_dir="${BACKUP_BUCKET}/${CIRCLE_TAG}/${BACKUP_DIR}"
retry aws s3 cp --recursive . "$s3_dir"
fi

View File

@ -4,15 +4,6 @@ set -eu -o pipefail
set +x
export AWS_ACCESS_KEY_ID="${PYTORCH_BINARY_AWS_ACCESS_KEY_ID}"
export AWS_SECRET_ACCESS_KEY="${PYTORCH_BINARY_AWS_SECRET_ACCESS_KEY}"
cat >/Users/distiller/project/login_to_anaconda.sh <<EOL
set +x
echo "Trying to login to Anaconda"
yes | anaconda login \
--username "$PYTORCH_BINARY_PJH5_CONDA_USERNAME" \
--password "$PYTORCH_BINARY_PJH5_CONDA_PASSWORD"
set -x
EOL
chmod +x /Users/distiller/project/login_to_anaconda.sh
#!#!#!#!#!#!#!#!#!#!#!#!#!#!#!#!#!#!#!#!#!#!#!#!
# DO NOT TURN -x ON BEFORE THIS LINE
@ -22,19 +13,36 @@ set -eux -o pipefail
source "/Users/distiller/project/env"
export "PATH=$workdir/miniconda/bin:$PATH"
# This gets set in binary_populate_env.sh, but lets have a sane default just in case
PIP_UPLOAD_FOLDER=${PIP_UPLOAD_FOLDER:-nightly}
# TODO: Combine CONDA_UPLOAD_CHANNEL and PIP_UPLOAD_FOLDER into one variable
# The only difference is the trailing slash
# Strip trailing slashes if there
CONDA_UPLOAD_CHANNEL=$(echo "${PIP_UPLOAD_FOLDER}" | sed 's:/*$::')
BACKUP_BUCKET="s3://pytorch-backup"
retry pip install -q awscli
pushd "$workdir/final_pkgs"
if [[ "$PACKAGE_TYPE" == conda ]]; then
retry conda install -yq anaconda-client
retry /Users/distiller/project/login_to_anaconda.sh
retry anaconda upload "$(ls)" -u pytorch-nightly --label main --no-progress --force
retry anaconda -t "${CONDA_PYTORCHBOT_TOKEN}" upload "$(ls)" -u "pytorch-${CONDA_UPLOAD_CHANNEL}" --label main --no-progress --force
# Fetch platform (eg. win-64, linux-64, etc.) from index file
# Because there's no actual conda command to read this
subdir=$(tar -xOf ./*.bz2 info/index.json | grep subdir | cut -d ':' -f2 | sed -e 's/[[:space:]]//' -e 's/"//g' -e 's/,//')
BACKUP_DIR="conda/${subdir}"
elif [[ "$PACKAGE_TYPE" == libtorch ]]; then
retry pip install -q awscli
s3_dir="s3://pytorch/libtorch/${PIP_UPLOAD_FOLDER}${DESIRED_CUDA}/"
for pkg in $(ls); do
retry aws s3 cp "$pkg" "$s3_dir" --acl public-read
done
BACKUP_DIR="libtorch/${PIP_UPLOAD_FOLDER}${DESIRED_CUDA}/"
else
retry pip install -q awscli
s3_dir="s3://pytorch/whl/${PIP_UPLOAD_FOLDER}${DESIRED_CUDA}/"
retry aws s3 cp "$(ls)" "$s3_dir" --acl public-read
BACKUP_DIR="whl/${PIP_UPLOAD_FOLDER}${DESIRED_CUDA}/"
fi
if [[ -n "${CIRCLE_TAG:-}" ]]; then
s3_dir="${BACKUP_BUCKET}/${CIRCLE_TAG}/${BACKUP_DIR}"
retry aws s3 cp --recursive . "$s3_dir"
fi

View File

@ -2,11 +2,31 @@
set -eux -o pipefail
export TZ=UTC
tagged_version() {
# Grabs version from either the env variable CIRCLE_TAG
# or the pytorch git described version
if [[ "$OSTYPE" == "msys" ]]; then
GIT_DESCRIBE="git --git-dir ${workdir}/p/.git describe"
else
GIT_DESCRIBE="git --git-dir ${workdir}/pytorch/.git describe"
fi
if [[ -n "${CIRCLE_TAG:-}" ]]; then
echo "${CIRCLE_TAG}"
elif ${GIT_DESCRIBE} --exact --tags >/dev/null; then
${GIT_DESCRIBE} --tags
else
return 1
fi
}
# We need to write an envfile to persist these variables to following
# steps, but the location of the envfile depends on the circleci executor
if [[ "$(uname)" == Darwin ]]; then
# macos executor (builds and tests)
workdir="/Users/distiller/project"
elif [[ "$OSTYPE" == "msys" ]]; then
# windows executor (builds and tests)
workdir="/c/w"
elif [[ -d "/home/circleci/project" ]]; then
# machine executor (binary tests)
workdir="/home/circleci/project"
@ -23,7 +43,15 @@ configs=($BUILD_ENVIRONMENT)
export PACKAGE_TYPE="${configs[0]}"
export DESIRED_PYTHON="${configs[1]}"
export DESIRED_CUDA="${configs[2]}"
export DESIRED_DEVTOOLSET="${configs[3]:-}"
if [[ "${BUILD_FOR_SYSTEM:-}" == "windows" ]]; then
export DESIRED_DEVTOOLSET=""
export LIBTORCH_CONFIG="${configs[3]:-}"
if [[ "$LIBTORCH_CONFIG" == 'debug' ]]; then
export DEBUG=1
fi
else
export DESIRED_DEVTOOLSET="${configs[3]:-}"
fi
if [[ "$PACKAGE_TYPE" == 'libtorch' ]]; then
export BUILD_PYTHONLESS=1
fi
@ -40,25 +68,27 @@ if [[ -z "$DOCKER_IMAGE" ]]; then
fi
fi
# Upload to parallel folder for devtoolsets
# All nightlies used to be devtoolset3, then devtoolset7 was added as a build
# option, so the upload was redirected to nightly/devtoolset7 to avoid
# conflicts with other binaries (there shouldn't be any conflicts). Now we are
# making devtoolset7 the default.
if [[ "$DESIRED_DEVTOOLSET" == 'devtoolset7' || "$DESIRED_DEVTOOLSET" == *"cxx11-abi"* || "$(uname)" == 'Darwin' ]]; then
export PIP_UPLOAD_FOLDER='nightly/'
else
# On linux machines, this shouldn't actually be called anymore. This is just
# here for extra safety.
export PIP_UPLOAD_FOLDER='nightly/devtoolset3/'
fi
# Default to nightly, since that's where this normally uploads to
PIP_UPLOAD_FOLDER='nightly/'
# We put this here so that OVERRIDE_PACKAGE_VERSION below can read from it
export DATE="$(date -u +%Y%m%d)"
if [[ "$(uname)" == 'Darwin' ]] || [[ "$DESIRED_CUDA" == "cu101" ]] || [[ "$PACKAGE_TYPE" == conda ]]; then
export PYTORCH_BUILD_VERSION="1.4.0.dev$DATE"
#TODO: We should be pulling semver version from the base version.txt
BASE_BUILD_VERSION="1.6.0.dev$DATE"
# Change BASE_BUILD_VERSION to git tag when on a git tag
# Use 'git -C' to make doubly sure we're in the correct directory for checking
# the git tag
if tagged_version >/dev/null; then
# Switch upload folder to 'test/' if we are on a tag
PIP_UPLOAD_FOLDER='test/'
# Grab git tag, remove prefixed v and remove everything after -
# Used to clean up tags that are for release candidates like v1.6.0-rc1
# Turns tag v1.6.0-rc1 -> v1.6.0
BASE_BUILD_VERSION="$(tagged_version | sed -e 's/^v//' -e 's/-.*$//')"
fi
if [[ "$(uname)" == 'Darwin' ]] || [[ "$DESIRED_CUDA" == "cu102" ]] || [[ "$PACKAGE_TYPE" == conda ]]; then
export PYTORCH_BUILD_VERSION="${BASE_BUILD_VERSION}"
else
export PYTORCH_BUILD_VERSION="1.4.0.dev$DATE+$DESIRED_CUDA"
export PYTORCH_BUILD_VERSION="${BASE_BUILD_VERSION}+$DESIRED_CUDA"
fi
export PYTORCH_BUILD_NUMBER=1
@ -94,9 +124,13 @@ export DESIRED_CUDA="$DESIRED_CUDA"
export LIBTORCH_VARIANT="${LIBTORCH_VARIANT:-}"
export BUILD_PYTHONLESS="${BUILD_PYTHONLESS:-}"
export DESIRED_DEVTOOLSET="$DESIRED_DEVTOOLSET"
if [[ "${BUILD_FOR_SYSTEM:-}" == "windows" ]]; then
export LIBTORCH_CONFIG="${LIBTORCH_CONFIG:-}"
export DEBUG="${DEBUG:-}"
fi
export DATE="$DATE"
export NIGHTLIES_DATE_PREAMBLE=1.4.0.dev
export NIGHTLIES_DATE_PREAMBLE=1.6.0.dev
export PYTORCH_BUILD_VERSION="$PYTORCH_BUILD_VERSION"
export PYTORCH_BUILD_NUMBER="$PYTORCH_BUILD_NUMBER"
export OVERRIDE_PACKAGE_VERSION="$PYTORCH_BUILD_VERSION"
@ -113,8 +147,13 @@ export DOCKER_IMAGE="$DOCKER_IMAGE"
export workdir="$workdir"
export MAC_PACKAGE_WORK_DIR="$workdir"
export PYTORCH_ROOT="$workdir/pytorch"
export BUILDER_ROOT="$workdir/builder"
if [[ "$OSTYPE" == "msys" ]]; then
export PYTORCH_ROOT="$workdir/p"
export BUILDER_ROOT="$workdir/b"
else
export PYTORCH_ROOT="$workdir/pytorch"
export BUILDER_ROOT="$workdir/builder"
fi
export MINICONDA_ROOT="$workdir/miniconda"
export PYTORCH_FINAL_PACKAGE_DIR="$workdir/final_pkgs"

View File

@ -16,31 +16,12 @@ set -eux -o pipefail
# Expect actual code to be written to this file
chmod +x /home/circleci/project/ci_test_script.sh
VOLUME_MOUNTS="-v /home/circleci/project/:/circleci_stuff -v /home/circleci/project/final_pkgs:/final_pkgs -v ${PYTORCH_ROOT}:/pytorch -v ${BUILDER_ROOT}:/builder"
# Run the docker
if [ -n "${USE_CUDA_DOCKER_RUNTIME:-}" ]; then
export id=$(docker run --cap-add=SYS_PTRACE --security-opt seccomp=unconfined --runtime=nvidia -t -d "${DOCKER_IMAGE}")
export id=$(docker run --cap-add=SYS_PTRACE --security-opt seccomp=unconfined --runtime=nvidia ${VOLUME_MOUNTS} -t -d "${DOCKER_IMAGE}")
else
export id=$(docker run --cap-add=SYS_PTRACE --security-opt seccomp=unconfined -t -d "${DOCKER_IMAGE}")
fi
# Copy the envfile and script with all the code to run into the docker.
docker cp /home/circleci/project/. "$id:/circleci_stuff"
# Copy built packages into the docker to test. This should only exist on the
# binary test jobs. The package should've been created from a binary build job,
# whhich persisted the package to a CircleCI workspace, which this job then
# copies into a GPU enabled docker for testing
if [[ -d "/home/circleci/project/final_pkgs" ]]; then
docker cp /home/circleci/project/final_pkgs "$id:/final_pkgs"
fi
# Copy the needed repos into the docker. These do not exist in the smoke test
# jobs, since the smoke test jobs do not need the Pytorch source code.
if [[ -d "$PYTORCH_ROOT" ]]; then
docker cp "$PYTORCH_ROOT" "$id:/pytorch"
fi
if [[ -d "$BUILDER_ROOT" ]]; then
docker cp "$BUILDER_ROOT" "$id:/builder"
export id=$(docker run --cap-add=SYS_PTRACE --security-opt seccomp=unconfined ${VOLUME_MOUNTS} -t -d "${DOCKER_IMAGE}")
fi
# Execute the test script that was populated by an earlier section

View File

@ -0,0 +1,41 @@
#!/bin/bash
set -eux -o pipefail
source "/c/w/env"
mkdir -p "$PYTORCH_FINAL_PACKAGE_DIR"
export CUDA_VERSION="${DESIRED_CUDA/cu/}"
export USE_SCCACHE=1
export SCCACHE_BUCKET=ossci-compiler-cache-windows
export NIGHTLIES_PYTORCH_ROOT="$PYTORCH_ROOT"
if [[ "$CUDA_VERSION" == "92" || "$CUDA_VERSION" == "100" ]]; then
export VC_YEAR=2017
else
export VC_YEAR=2019
fi
set +x
export AWS_ACCESS_KEY_ID=${CIRCLECI_AWS_ACCESS_KEY_FOR_SCCACHE_S3_BUCKET_V4:-}
export AWS_SECRET_ACCESS_KEY=${CIRCLECI_AWS_SECRET_KEY_FOR_SCCACHE_S3_BUCKET_V4:-}
set -x
if [[ "$CIRCLECI" == 'true' && -d "C:\\ProgramData\\Microsoft\\VisualStudio\\Packages\\_Instances" ]]; then
mv "C:\\ProgramData\\Microsoft\\VisualStudio\\Packages\\_Instances" .
rm -rf "C:\\ProgramData\\Microsoft\\VisualStudio\\Packages"
mkdir -p "C:\\ProgramData\\Microsoft\\VisualStudio\\Packages"
mv _Instances "C:\\ProgramData\\Microsoft\\VisualStudio\\Packages"
fi
echo "Free space on filesystem before build:"
df -h
pushd "$BUILDER_ROOT"
if [[ "$PACKAGE_TYPE" == 'conda' ]]; then
./windows/internal/build_conda.bat
elif [[ "$PACKAGE_TYPE" == 'wheel' || "$PACKAGE_TYPE" == 'libtorch' ]]; then
./windows/internal/build_wheels.bat
fi
echo "Free space on filesystem after build:"
df -h

View File

@ -0,0 +1,19 @@
#!/bin/bash
set -eux -o pipefail
source "/c/w/env"
export CUDA_VERSION="${DESIRED_CUDA/cu/}"
export VC_YEAR=2017
if [[ "$CUDA_VERSION" == "92" || "$CUDA_VERSION" == "100" ]]; then
export VC_YEAR=2017
else
export VC_YEAR=2019
fi
pushd "$BUILDER_ROOT"
./windows/internal/smoke_test.bat
popd

View File

@ -0,0 +1,47 @@
#!/bin/bash
set -eu -o pipefail
set +x
declare -x "AWS_ACCESS_KEY_ID=${PYTORCH_BINARY_AWS_ACCESS_KEY_ID}"
declare -x "AWS_SECRET_ACCESS_KEY=${PYTORCH_BINARY_AWS_SECRET_ACCESS_KEY}"
#!#!#!#!#!#!#!#!#!#!#!#!#!#!#!#!#!#!#!#!#!#!#!#!
# DO NOT TURN -x ON BEFORE THIS LINE
#!#!#!#!#!#!#!#!#!#!#!#!#!#!#!#!#!#!#!#!#!#!#!#!
set -eux -o pipefail
source "/env"
# This gets set in binary_populate_env.sh, but lets have a sane default just in case
PIP_UPLOAD_FOLDER=${PIP_UPLOAD_FOLDER:-nightly/}
# TODO: Combine CONDA_UPLOAD_CHANNEL and PIP_UPLOAD_FOLDER into one variable
# The only difference is the trailing slash
# Strip trailing slashes if there
CONDA_UPLOAD_CHANNEL=$(echo "${PIP_UPLOAD_FOLDER}" | sed 's:/*$::')
BACKUP_BUCKET="s3://pytorch-backup"
retry pip install -q awscli
pushd /root/workspace/final_pkgs
# Upload the package to the final location
if [[ "$PACKAGE_TYPE" == conda ]]; then
retry conda install -yq anaconda-client
retry anaconda -t "${CONDA_PYTORCHBOT_TOKEN}" upload "$(ls)" -u "pytorch-${CONDA_UPLOAD_CHANNEL}" --label main --no-progress --force
# Fetch platform (eg. win-64, linux-64, etc.) from index file
# Because there's no actual conda command to read this
subdir=$(tar -xOf ./*.bz2 info/index.json | grep subdir | cut -d ':' -f2 | sed -e 's/[[:space:]]//' -e 's/"//g' -e 's/,//')
BACKUP_DIR="conda/${subdir}"
elif [[ "$PACKAGE_TYPE" == libtorch ]]; then
s3_dir="s3://pytorch/libtorch/${PIP_UPLOAD_FOLDER}${DESIRED_CUDA}/"
for pkg in $(ls); do
retry aws s3 cp "$pkg" "$s3_dir" --acl public-read
done
BACKUP_DIR="libtorch/${PIP_UPLOAD_FOLDER}${DESIRED_CUDA}/"
else
s3_dir="s3://pytorch/whl/${PIP_UPLOAD_FOLDER}${DESIRED_CUDA}/"
retry aws s3 cp "$(ls)" "$s3_dir" --acl public-read
BACKUP_DIR="whl/${PIP_UPLOAD_FOLDER}${DESIRED_CUDA}/"
fi
if [[ -n "${CIRCLE_TAG:-}" ]]; then
s3_dir="${BACKUP_BUCKET}/${CIRCLE_TAG}/${BACKUP_DIR}"
retry aws s3 cp --recursive . "$s3_dir"
fi

View File

@ -57,7 +57,6 @@ time python aten/src/ATen/gen.py \
-s aten/src/ATen \
-d build/aten/src/ATen \
aten/src/ATen/Declarations.cwrap \
aten/src/THNN/generic/THNN.h \
aten/src/THCUNN/generic/THCUNN.h \
aten/src/ATen/nn.yaml \
aten/src/ATen/native/native_functions.yaml
@ -73,10 +72,10 @@ time python tools/setup_helpers/generate_code.py \
# Build the docs
pushd docs/cpp
pip install breathe==4.11.1 bs4 lxml six
pip install breathe==4.13.0 bs4 lxml six
pip install --no-cache-dir -e "git+https://github.com/pytorch/pytorch_sphinx_theme.git#egg=pytorch_sphinx_theme"
pip install exhale>=0.2.1
pip install sphinx==1.8.5
pip install sphinx==2.4.4
# Uncomment once it is fixed
# pip install -r requirements.txt
time make VERBOSE=1 html -j

View File

@ -71,8 +71,30 @@ cp -a ../vision/docs/source source/torchvision
# Build the docs
pip -q install -r requirements.txt || true
if [ "$is_master_doc" = true ]; then
# TODO: fix gh-38011 then enable this which changes warnings into errors
# export SPHINXOPTS="-WT --keep-going"
make html
make coverage
# Now we have the coverage report, we need to make sure it is empty.
# Count the number of lines in the file and turn that number into a variable
# $lines. The `cut -f1 ...` is to only parse the number, not the filename
# Skip the report header by subtracting 2: the header will be output even if
# there are no undocumented items.
#
# Also: see docs/source/conf.py for "coverage_ignore*" items, which should
# be documented then removed from there.
lines=$(wc -l build/coverage/python.txt 2>/dev/null |cut -f1 -d' ')
undocumented=$(($lines - 2))
if [ $undocumented -lt 0 ]; then
echo coverage output not found
exit 1
elif [ $undocumented -gt 0 ]; then
echo undocumented objects found:
cat build/coverage/python.txt
exit 1
fi
else
# Don't fail the build on coverage problems
make html-stable
fi
@ -90,6 +112,12 @@ else
find "$install_path" -name "*.html" -print0 | xargs -0 perl -pi -w -e "s@master\s+\((\d\.\d\.[A-Fa-f0-9]+\+[A-Fa-f0-9]+)\s+\)@<a href='http://pytorch.org/docs/versions.html'>$version \&#x25BC</a>@g"
fi
# Prevent Google from indexing $install_path/_modules. This folder contains
# generated source files.
# NB: the following only works on gnu sed. The sed shipped with mac os is different.
# One can `brew install gnu-sed` on a mac and then use "gsed" instead of "sed".
find "$install_path/_modules" -name "*.html" -print0 | xargs -0 sed -i '/<head>/a \ \ <meta name="robots" content="noindex">'
git add "$install_path" || true
git status
git config user.email "soumith+bot@pytorch.org"

View File

@ -2,7 +2,7 @@
set -ex -o pipefail
# Set up NVIDIA docker repo
curl -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add -
curl -s -L --retry 3 https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add -
echo "deb https://nvidia.github.io/libnvidia-container/ubuntu16.04/amd64 /" | sudo tee -a /etc/apt/sources.list.d/nvidia-docker.list
echo "deb https://nvidia.github.io/nvidia-container-runtime/ubuntu16.04/amd64 /" | sudo tee -a /etc/apt/sources.list.d/nvidia-docker.list
echo "deb https://nvidia.github.io/nvidia-docker/ubuntu16.04/amd64 /" | sudo tee -a /etc/apt/sources.list.d/nvidia-docker.list
@ -13,6 +13,15 @@ sudo rm -f /etc/apt/heroku.list
sudo rm -f /etc/apt/openjdk-r-ubuntu-ppa-xenial.list
sudo rm -f /etc/apt/partner.list
retry () {
$* || $* || $* || $* || $*
}
# Method adapted from here: https://askubuntu.com/questions/875213/apt-get-to-retry-downloading
# (with use of tee to avoid permissions problems)
# This is better than retrying the whole apt-get command
echo "APT::Acquire::Retries \"3\";" | sudo tee /etc/apt/apt.conf.d/80-retries
sudo apt-get -y update
sudo apt-get -y remove linux-image-generic linux-headers-generic linux-generic docker-ce
# WARNING: Docker version is hardcoded here; you must update the
@ -27,7 +36,11 @@ sudo apt-get -y remove linux-image-generic linux-headers-generic linux-generic d
# Ubuntu version (e.g., docker run -it ubuntu:16.04) and then ask
# apt what the packages you need are. Note that the CircleCI image
# comes with Docker.
sudo apt-get -y install \
#
# Using 'retry' here as belt-and-suspenders even though we are
# presumably retrying at the single-package level via the
# apt.conf.d/80-retries technique.
retry sudo apt-get -y install \
linux-headers-$(uname -r) \
linux-image-generic \
moreutils \
@ -38,14 +51,11 @@ sudo apt-get -y install \
sudo pkill -SIGHUP dockerd
retry () {
$* || $* || $* || $* || $*
}
retry sudo pip -q install awscli==1.16.35
if [ -n "${USE_CUDA_DOCKER_RUNTIME:-}" ]; then
DRIVER_FN="NVIDIA-Linux-x86_64-430.40.run"
DRIVER_FN="NVIDIA-Linux-x86_64-440.59.run"
wget "https://s3.amazonaws.com/ossci-linux/nvidia_driver/$DRIVER_FN"
sudo /bin/bash "$DRIVER_FN" -s --no-drm || (sudo cat /var/log/nvidia-installer.log && false)
nvidia-smi

View File

@ -2,7 +2,7 @@
set -eux -o pipefail
# Set up CircleCI GPG keys for apt, if needed
curl -L https://packagecloud.io/circleci/trusty/gpgkey | sudo apt-key add -
curl --retry 3 -s -L https://packagecloud.io/circleci/trusty/gpgkey | sudo apt-key add -
# Stop background apt updates. Hypothetically, the kill should not
# be necessary, because stop is supposed to send a kill signal to

View File

@ -1,140 +0,0 @@
import argparse
import re
import sys
# Modify this variable if you want to change the set of default jobs
# which are run on all pull requests.
#
# WARNING: Actually, this is a lie; we're currently also controlling
# the set of jobs to run via the Workflows filters in CircleCI config.
default_set = set([
# PyTorch CPU
# Selected oldest Python 2 version to ensure Python 2 coverage
'pytorch-linux-xenial-py2.7.9',
# PyTorch CUDA
'pytorch-linux-xenial-cuda9-cudnn7-py3',
# PyTorch ASAN
'pytorch-linux-xenial-py3-clang5-asan',
# PyTorch DEBUG
'pytorch-linux-xenial-py3.6-gcc5.4',
# LibTorch
'pytorch-libtorch-linux-xenial-cuda9-cudnn7-py3',
# Caffe2 CPU
'caffe2-py2-mkl-ubuntu16.04',
# Caffe2 CUDA
'caffe2-py3.5-cuda10.1-cudnn7-ubuntu16.04',
# Caffe2 ONNX
'caffe2-onnx-py2-gcc5-ubuntu16.04',
'caffe2-onnx-py3.6-clang7-ubuntu16.04',
# Caffe2 Clang
'caffe2-py2-clang7-ubuntu16.04',
# Caffe2 CMake
'caffe2-cmake-cuda9.0-cudnn7-ubuntu16.04',
# Caffe2 CentOS
'caffe2-py3.6-devtoolset7-cuda9.0-cudnn7-centos7',
# Binaries
'manywheel 2.7mu cpu devtoolset7',
'libtorch 2.7m cpu devtoolset7',
'libtorch 2.7m cpu gcc5.4_cxx11-abi',
'libtorch 2.7 cpu',
'libtorch-ios-11.2.1-nightly-x86_64-build',
'libtorch-ios-11.2.1-nightly-arm64-build',
'libtorch-ios-11.2.1-nightly-binary-build-upload',
# Caffe2 Android
'caffe2-py2-android-ubuntu16.04',
# Caffe2 OSX
'caffe2-py2-system-macos10.13',
# PyTorch OSX
'pytorch-macos-10.13-py3',
'pytorch-macos-10.13-cuda9.2-cudnn7-py3',
# PyTorch Android
'pytorch-linux-xenial-py3-clang5-android-ndk-r19c-x86_32-build',
'pytorch-linux-xenial-py3-clang5-android-ndk-r19',
# PyTorch Android gradle
'pytorch-linux-xenial-py3-clang5-android-ndk-r19c-gradle-build-only-x86_32',
# Pytorch iOS builds
'pytorch-ios-11.2.1-x86_64_build',
'pytorch-ios-11.2.1-arm64_build',
# PyTorch Mobile builds
'pytorch-linux-xenial-py3-clang5-mobile-build',
# Pytorch backward compatibility check
'pytorch-linux-backward-compatibility-check-test',
# XLA
'pytorch-xla-linux-xenial-py3.6-clang7',
# GraphExecutor config jobs
'pytorch-linux-xenial-py3.6-gcc5.4-ge_config_simple-test',
'pytorch-linux-xenial-py3.6-gcc5.4-ge_config_legacy-test',
# Other checks
'pytorch-short-perf-test-gpu',
'pytorch-python-doc-push',
'pytorch-cpp-doc-push',
])
# Collection of jobs that are *temporarily* excluded from running on PRs.
# Use this if there is a long-running job breakage that we can't fix with a
# single revert.
skip_override = {
# example entry:
# 'pytorch-cpp-doc-push': "https://github.com/pytorch/pytorch/issues/<related issue>"
}
# Takes in commit message to analyze via stdin
#
# This script will query Git and attempt to determine if we should
# run the current CI job under question
#
# NB: Try to avoid hard-coding names here, so there's less place to update when jobs
# are updated/renamed
#
# Semantics in the presence of multiple tags:
# - Let D be the set of default builds
# - Let S be the set of explicitly specified builds
# - Let O be the set of temporarily skipped builds
# - Run S \/ (D - O)
parser = argparse.ArgumentParser()
parser.add_argument('build_environment')
args = parser.parse_args()
commit_msg = sys.stdin.read()
# Matches anything that looks like [foo ci] or [ci foo] or [foo test]
# or [test foo]
RE_MARKER = re.compile(r'\[(?:([^ \[\]]+) )?(?:ci|test)(?: ([^ \[\]]+))?\]')
markers = RE_MARKER.finditer(commit_msg)
for m in markers:
if m.group(1) and m.group(2):
print("Unrecognized marker: {}".format(m.group(0)))
continue
spec = m.group(1) or m.group(2)
if spec is None:
print("Unrecognized marker: {}".format(m.group(0)))
continue
if spec in args.build_environment or spec == 'all':
print("Accepting {} due to commit marker {}".format(args.build_environment, m.group(0)))
sys.exit(0)
skip_override_set = set(skip_override.keys())
should_run_set = default_set - skip_override_set
for spec in should_run_set:
if spec in args.build_environment:
print("Accepting {} as part of default set".format(args.build_environment))
sys.exit(0)
print("Rejecting {}".format(args.build_environment))
for spec, issue in skip_override.items():
if spec in args.build_environment:
print("This job is temporarily excluded from running on PRs. Reason: {}".format(issue))
break
sys.exit(1)

View File

@ -1,29 +0,0 @@
#!/usr/bin/env bash
set -exu -o pipefail
SCRIPT_DIR="$( cd "$( dirname "${BASH_SOURCE[0]}" )" >/dev/null 2>&1 && pwd )"
# Check if we should actually run
echo "BUILD_ENVIRONMENT: ${BUILD_ENVIRONMENT:-}"
echo "CIRCLE_PULL_REQUEST: ${CIRCLE_PULL_REQUEST:-}"
if [ -z "${BUILD_ENVIRONMENT:-}" ]; then
echo "Cannot run should_run_job.sh if BUILD_ENVIRONMENT is not defined!"
echo "CircleCI scripts are probably misconfigured."
exit 1
fi
if ! [ -e "$SCRIPT_DIR/COMMIT_MSG" ]; then
echo "Cannot run should_run_job.sh if you don't have COMMIT_MSG"
echo "written out. Are you perhaps running the wrong copy of this script?"
echo "You should be running the copy in ~/workspace; SCRIPT_DIR=$SCRIPT_DIR"
exit 1
fi
if [ -n "${CIRCLE_PULL_REQUEST:-}" ]; then
if [[ $CIRCLE_BRANCH != "ci-all/"* ]] && [[ $CIRCLE_BRANCH != "nightly" ]] && [[ $CIRCLE_BRANCH != "postnightly" ]] ; then
# Don't swallow "script doesn't exist
[ -e "$SCRIPT_DIR/should_run_job.py" ]
if ! python "$SCRIPT_DIR/should_run_job.py" "${BUILD_ENVIRONMENT:-}" < "$SCRIPT_DIR/COMMIT_MSG" ; then
circleci step halt
exit
fi
fi
fi

View File

@ -0,0 +1,145 @@
import glob
import json
import logging
import os
import os.path
import pathlib
import re
import sys
import time
import zipfile
import requests
def get_size(file_dir):
try:
# we should only expect one file, if no, something is wrong
file_name = glob.glob(os.path.join(file_dir, "*"))[0]
return os.stat(file_name).st_size
except:
logging.exception(f"error getting file from: {file_dir}")
return 0
def build_message(size):
pkg_type, py_ver, cu_ver, *_ = os.environ.get("BUILD_ENVIRONMENT", "").split() + [
None,
None,
None,
]
os_name = os.uname()[0].lower()
if os_name == "darwin":
os_name = "macos"
return {
"normal": {
"os": os_name,
"pkg_type": pkg_type,
"py_ver": py_ver,
"cu_ver": cu_ver,
"pr": os.environ.get("CIRCLE_PR_NUMBER"),
"build_num": os.environ.get("CIRCLE_BUILD_NUM"),
"sha1": os.environ.get("CIRCLE_SHA1"),
"branch": os.environ.get("CIRCLE_BRANCH"),
},
"int": {
"time": int(time.time()),
"size": size,
"commit_time": int(os.environ.get("COMMIT_TIME", "0")),
},
}
def send_message(messages):
access_token = os.environ.get("SCRIBE_GRAPHQL_ACCESS_TOKEN")
if not access_token:
raise ValueError("Can't find access token from environment variable")
url = "https://graph.facebook.com/scribe_logs"
r = requests.post(
url,
data={
"access_token": access_token,
"logs": json.dumps(
[
{
"category": "perfpipe_pytorch_binary_size",
"message": json.dumps(message),
"line_escape": False,
}
for message in messages
]
),
},
)
print(r.text)
r.raise_for_status()
def report_android_sizes(file_dir):
def gen_sizes():
# we should only expect one file, if no, something is wrong
aar_files = list(pathlib.Path(file_dir).rglob("pytorch_android-*.aar"))
if len(aar_files) != 1:
logging.exception(f"error getting aar files from: {file_dir} / {aar_files}")
return
aar_file = aar_files[0]
zf = zipfile.ZipFile(aar_file)
for info in zf.infolist():
# Scan ".so" libs in `jni` folder. Examples:
# jni/arm64-v8a/libfbjni.so
# jni/arm64-v8a/libpytorch_jni.so
m = re.match(r"^jni/([^/]+)/(.*\.so)$", info.filename)
if not m:
continue
arch, lib = m.groups()
# report per architecture library size
yield [arch, lib, info.compress_size, info.file_size]
# report whole package size
yield ["aar", aar_file.name, os.stat(aar_file).st_size, 0]
def gen_messages():
android_build_type = os.environ.get("ANDROID_BUILD_TYPE")
for arch, lib, comp_size, uncomp_size in gen_sizes():
print(android_build_type, arch, lib, comp_size, uncomp_size)
yield {
"normal": {
"os": "android",
# TODO: create dedicated columns
"pkg_type": "{}/{}/{}".format(android_build_type, arch, lib),
"cu_ver": "", # dummy value for derived field `build_name`
"py_ver": "", # dummy value for derived field `build_name`
"pr": os.environ.get("CIRCLE_PR_NUMBER"),
"build_num": os.environ.get("CIRCLE_BUILD_NUM"),
"sha1": os.environ.get("CIRCLE_SHA1"),
"branch": os.environ.get("CIRCLE_BRANCH"),
},
"int": {
"time": int(time.time()),
"commit_time": int(os.environ.get("COMMIT_TIME", "0")),
"size": comp_size,
"raw_size": uncomp_size,
},
}
send_message(list(gen_messages()))
if __name__ == "__main__":
file_dir = os.environ.get(
"PYTORCH_FINAL_PACKAGE_DIR", "/home/circleci/project/final_pkgs"
)
if len(sys.argv) == 2:
file_dir = sys.argv[1]
print("checking dir: " + file_dir)
if "-android" in os.environ.get("BUILD_ENVIRONMENT", ""):
report_android_sizes(file_dir)
else:
size = get_size(file_dir)
if size != 0:
try:
send_message([build_message(size)])
except:
logging.exception("can't send message")

View File

@ -0,0 +1,34 @@
$VS_DOWNLOAD_LINK = "https://aka.ms/vs/15/release/vs_buildtools.exe"
$COLLECT_DOWNLOAD_LINK = "https://aka.ms/vscollect.exe"
$VS_INSTALL_ARGS = @("--nocache","--quiet","--wait", "--add Microsoft.VisualStudio.Workload.VCTools",
"--add Microsoft.VisualStudio.Component.VC.Tools.14.11",
"--add Microsoft.Component.MSBuild",
"--add Microsoft.VisualStudio.Component.Roslyn.Compiler",
"--add Microsoft.VisualStudio.Component.TextTemplating",
"--add Microsoft.VisualStudio.Component.VC.CoreIde",
"--add Microsoft.VisualStudio.Component.VC.Redist.14.Latest",
"--add Microsoft.VisualStudio.ComponentGroup.NativeDesktop.Core",
"--add Microsoft.VisualStudio.Component.VC.Tools.x86.x64",
"--add Microsoft.VisualStudio.ComponentGroup.NativeDesktop.Win81")
curl.exe --retry 3 -kL $VS_DOWNLOAD_LINK --output vs_installer.exe
if ($LASTEXITCODE -ne 0) {
echo "Download of the VS 2017 installer failed"
exit 1
}
$process = Start-Process "${PWD}\vs_installer.exe" -ArgumentList $VS_INSTALL_ARGS -NoNewWindow -Wait -PassThru
Remove-Item -Path vs_installer.exe -Force
$exitCode = $process.ExitCode
if (($exitCode -ne 0) -and ($exitCode -ne 3010)) {
echo "VS 2017 installer exited with code $exitCode, which should be one of [0, 3010]."
curl.exe --retry 3 -kL $COLLECT_DOWNLOAD_LINK --output Collect.exe
if ($LASTEXITCODE -ne 0) {
echo "Download of the VS Collect tool failed."
exit 1
}
Start-Process "${PWD}\Collect.exe" -NoNewWindow -Wait -PassThru
New-Item -Path "C:\w\build-results" -ItemType "directory" -Force
Copy-Item -Path "C:\Users\circleci\AppData\Local\Temp\vslogs.zip" -Destination "C:\w\build-results\"
exit 1
}

View File

@ -0,0 +1,37 @@
#!/bin/bash
set -eux -o pipefail
curl --retry 3 -kLO https://ossci-windows.s3.amazonaws.com/cuda_10.1.243_426.00_win10.exe
7z x cuda_10.1.243_426.00_win10.exe -ocuda_10.1.243_426.00_win10
cd cuda_10.1.243_426.00_win10
mkdir cuda_install_logs
set +e
./setup.exe -s nvcc_10.1 cuobjdump_10.1 nvprune_10.1 cupti_10.1 cublas_10.1 cublas_dev_10.1 cudart_10.1 cufft_10.1 cufft_dev_10.1 curand_10.1 curand_dev_10.1 cusolver_10.1 cusolver_dev_10.1 cusparse_10.1 cusparse_dev_10.1 nvgraph_10.1 nvgraph_dev_10.1 npp_10.1 npp_dev_10.1 nvrtc_10.1 nvrtc_dev_10.1 nvml_dev_10.1 -loglevel:6 -log:"$(pwd -W)/cuda_install_logs"
set -e
if [[ "${VC_YEAR}" == "2017" ]]; then
cp -r CUDAVisualStudioIntegration/extras/visual_studio_integration/MSBuildExtensions/* "C:/Program Files (x86)/Microsoft Visual Studio/2017/${VC_PRODUCT}/Common7/IDE/VC/VCTargets/BuildCustomizations/"
else
cp -r CUDAVisualStudioIntegration/extras/visual_studio_integration/MSBuildExtensions/* "C:/Program Files (x86)/Microsoft Visual Studio/2019/${VC_PRODUCT}/MSBuild/Microsoft/VC/v160/BuildCustomizations/"
fi
curl --retry 3 -kLO https://ossci-windows.s3.amazonaws.com/NvToolsExt.7z
7z x NvToolsExt.7z -oNvToolsExt
mkdir -p "C:/Program Files/NVIDIA Corporation/NvToolsExt"
cp -r NvToolsExt/* "C:/Program Files/NVIDIA Corporation/NvToolsExt/"
export NVTOOLSEXT_PATH="C:\\Program Files\\NVIDIA Corporation\\NvToolsExt\\"
if ! ls "/c/Program Files/NVIDIA GPU Computing Toolkit/CUDA/v10.1/bin/nvcc.exe"
then
echo "CUDA installation failed"
mkdir -p /c/w/build-results
7z a "c:\\w\\build-results\\cuda_install_logs.7z" cuda_install_logs
exit 1
fi
cd ..
rm -rf ./cuda_10.1.243_426.00_win10
rm -f ./cuda_10.1.243_426.00_win10.exe

View File

@ -1,43 +1,44 @@
#!/usr/bin/env python3
import urllib.request
import re
import cimodel.data.pytorch_build_definitions as pytorch_build_definitions
import cimodel.data.caffe2_build_definitions as caffe2_build_definitions
import cimodel.data.simple.util.docker_constants as pytorch_docker_constants
RE_VERSION = re.compile(r'allDeployedVersions = "([0-9,]+)"')
from yaml import load
URL_TEMPLATE = (
"https://raw.githubusercontent.com/pytorch/ossci-job-dsl/"
"master/src/main/groovy/ossci/{}/DockerVersion.groovy"
)
try:
from yaml import CLoader as Loader
except ImportError:
from yaml import Loader
def load_config(filename=".circleci/config.yml"):
with open(filename, "r") as fh:
return load("".join(fh.readlines()), Loader)
def load_tags_for_projects(workflow_config):
return {
v["ecr_gc_job"]["project"]: v["ecr_gc_job"]["tags_to_keep"]
for v in workflow_config["workflows"]["ecr_gc"]["jobs"]
if isinstance(v, dict) and "ecr_gc_job" in v
}
def check_version(job, tags, expected_version):
valid_versions = tags[job].split(",")
if expected_version not in valid_versions:
raise RuntimeError(
"We configured {} to use Docker version {}; but this "
"version is not configured in job ecr_gc_job_for_{}. Non-deployed versions will be "
"garbage collected two weeks after they are created. DO NOT LAND "
"THIS TO MASTER without also updating ossci-job-dsl with this version."
"\n\nDeployed versions: {}".format(job, expected_version, job, tags[job])
)
def check_version(job, expected_version):
url = URL_TEMPLATE.format(job)
with urllib.request.urlopen(url) as f:
contents = f.read().decode('utf-8')
m = RE_VERSION.search(contents)
if not m:
raise RuntimeError(
"Unbelievable! I could not find the variable allDeployedVersions in "
"{}; did the organization of ossci-job-dsl change?\n\nFull contents:\n{}"
.format(url, contents)
)
valid_versions = [int(v) for v in m.group(1).split(',')]
if expected_version not in valid_versions:
raise RuntimeError(
"We configured {} to use Docker version {}; but this "
"version is not deployed in {}. Non-deployed versions will be "
"garbage collected two weeks after they are created. DO NOT LAND "
"THIS TO MASTER without also updating ossci-job-dsl with this version."
"\n\nDeployed versions: {}"
.format(job, expected_version, url, m.group(1))
)
def validate_docker_version():
check_version('pytorch', pytorch_build_definitions.DOCKER_IMAGE_VERSION)
check_version('caffe2', caffe2_build_definitions.DOCKER_IMAGE_VERSION)
tags = load_tags_for_projects(load_config())
check_version("pytorch", tags, pytorch_docker_constants.DOCKER_IMAGE_TAG)
check_version("caffe2", tags, caffe2_build_definitions.DOCKER_IMAGE_VERSION)
if __name__ == "__main__":

View File

@ -1,20 +0,0 @@
# There is currently no testing for libtorch TODO
# binary_linux_libtorch_2.7m_cpu_test:
# environment:
# BUILD_ENVIRONMENT: "libtorch 2.7m cpu"
# resource_class: gpu.medium
# <<: *binary_linux_test
#
# binary_linux_libtorch_2.7m_cu90_test:
# environment:
# BUILD_ENVIRONMENT: "libtorch 2.7m cu90"
# resource_class: gpu.medium
# <<: *binary_linux_test
#
# binary_linux_libtorch_2.7m_cu100_test:
# environment:
# BUILD_ENVIRONMENT: "libtorch 2.7m cu100"
# resource_class: gpu.medium
# <<: *binary_linux_test

View File

@ -52,3 +52,15 @@ binary_mac_params: &binary_mac_params
environment:
BUILD_ENVIRONMENT: << parameters.build_environment >>
binary_windows_params: &binary_windows_params
parameters:
build_environment:
type: string
default: ""
executor:
type: string
default: "windows-cpu-with-nvidia-cuda"
environment:
BUILD_ENVIRONMENT: << parameters.build_environment >>
BUILD_FOR_SYSTEM: windows
JOB_EXECUTOR: <<parameters.executor>>

View File

@ -25,4 +25,3 @@ caffe2_params: &caffe2_params
DOCKER_IMAGE: << parameters.docker_image >>
BUILD_ONLY: << parameters.build_only >>
resource_class: << parameters.resource_class >>

View File

@ -0,0 +1,14 @@
promote_common: &promote_common
docker:
- image: pytorch/release
parameters:
package_name:
description: "package name to promote"
type: string
default: ""
environment:
PACKAGE_NAME: << parameters.package_name >>
ANACONDA_API_TOKEN: ${CONDA_PYTORCHBOT_TOKEN}
AWS_ACCESS_KEY_ID: ${PYTORCH_BINARY_AWS_ACCESS_KEY_ID}
AWS_SECRET_ACCESS_KEY: ${PYTORCH_BINARY_AWS_SECRET_ACCESS_KEY}

View File

@ -0,0 +1,85 @@
pytorch_params: &pytorch_params
parameters:
build_environment:
type: string
default: ""
docker_image:
type: string
default: ""
resource_class:
type: string
default: "large"
use_cuda_docker_runtime:
type: string
default: ""
build_only:
type: string
default: ""
environment:
BUILD_ENVIRONMENT: << parameters.build_environment >>
DOCKER_IMAGE: << parameters.docker_image >>
USE_CUDA_DOCKER_RUNTIME: << parameters.use_cuda_docker_runtime >>
BUILD_ONLY: << parameters.build_only >>
resource_class: << parameters.resource_class >>
pytorch_ios_params: &pytorch_ios_params
parameters:
build_environment:
type: string
default: ""
ios_arch:
type: string
default: ""
ios_platform:
type: string
default: ""
op_list:
type: string
default: ""
environment:
BUILD_ENVIRONMENT: << parameters.build_environment >>
IOS_ARCH: << parameters.ios_arch >>
IOS_PLATFORM: << parameters.ios_platform >>
SELECTED_OP_LIST: << parameters.op_list >>
pytorch_windows_params: &pytorch_windows_params
parameters:
executor:
type: string
default: "windows-cpu-with-nvidia-cuda"
build_environment:
type: string
default: ""
test_name:
type: string
default: ""
cuda_version:
type: string
default: "10"
python_version:
type: string
default: "3.6"
vc_version:
type: string
default: "14.11"
vc_year:
type: string
default: "2017"
vc_product:
type: string
default: "BuildTools"
use_cuda:
type: string
default: ""
environment:
BUILD_ENVIRONMENT: <<parameters.build_environment>>
SCCACHE_BUCKET: "ossci-compiler-cache"
CUDA_VERSION: <<parameters.cuda_version>>
PYTHON_VERSION: <<parameters.python_version>>
VC_VERSION: <<parameters.vc_version>>
VC_YEAR: <<parameters.vc_year>>
VC_PRODUCT: <<parameters.vc_product>>
USE_CUDA: <<parameters.use_cuda>>
TORCH_CUDA_ARCH_LIST: "7.5"
JOB_BASE_NAME: <<parameters.test_name>>
JOB_EXECUTOR: <<parameters.executor>>

View File

@ -1,18 +1,23 @@
commands:
# NB: This command must be run as the first command in a job. It
# attaches the workspace at ~/workspace; this workspace is generated
# by the setup job. Note that ~/workspace is not the default working
# directory (that's ~/project).
should_run_job:
description: "Test if the job should run or not"
# Must be run after attaching workspace from previous steps
load_shared_env:
description: "Loads .circleci/shared/env_file into ${BASH_ENV}"
parameters:
# For some weird reason we decide to reattach our workspace to ~/workspace so
# in the vein of making it simple let's assume our share env_file is here
root:
type: string
default: "~/workspace"
steps:
- attach_workspace:
name: Attaching workspace
at: ~/workspace
- run:
name: Should run job
no_output_timeout: "2m"
command: ~/workspace/.circleci/scripts/should_run_job.sh
name: "Load .circleci/shared/env_file into ${BASH_ENV}"
command: |
if [[ -f "<< parameters.root >>/.circleci/shared/env_file" ]]; then
cat << parameters.root >>/.circleci/shared/env_file >> ${BASH_ENV}
else
echo "We didn't have a shared env file, that's weird"
fi
# This system setup script is meant to run before the CI-related scripts, e.g.,
# installing Git client, checking out code, setting up CI env, and
@ -22,14 +27,14 @@ commands:
- run:
name: Set Up System Environment
no_output_timeout: "1h"
command: ~/workspace/.circleci/scripts/setup_linux_system_environment.sh
command: .circleci/scripts/setup_linux_system_environment.sh
setup_ci_environment:
steps:
- run:
name: Set Up CI Environment After attach_workspace
no_output_timeout: "1h"
command: ~/workspace/.circleci/scripts/setup_ci_environment.sh
command: .circleci/scripts/setup_ci_environment.sh
brew_update:
description: "Update Homebrew and install base formulae"
@ -88,3 +93,41 @@ commands:
- brew_update
- brew_install:
formulae: libtool
optional_merge_target_branch:
steps:
- run:
name: (Optional) Merge target branch
no_output_timeout: "10m"
command: |
if [ -n "$CIRCLE_PULL_REQUEST" ]; then
PR_NUM=$(basename $CIRCLE_PULL_REQUEST)
CIRCLE_PR_BASE_BRANCH=$(curl -s https://api.github.com/repos/$CIRCLE_PROJECT_USERNAME/$CIRCLE_PROJECT_REPONAME/pulls/$PR_NUM | jq -r '.base.ref')
if [[ "${BUILD_ENVIRONMENT}" == *"xla"* || "${BUILD_ENVIRONMENT}" == *"gcc5"* ]] ; then
set -x
git config --global user.email "circleci.ossci@gmail.com"
git config --global user.name "CircleCI"
git config remote.origin.url https://github.com/pytorch/pytorch.git
git config --add remote.origin.fetch +refs/heads/master:refs/remotes/origin/master
git fetch --tags --progress https://github.com/pytorch/pytorch.git +refs/heads/master:refs/remotes/origin/master --depth=100 --quiet
# PRs generated from ghstack has format CIRCLE_PR_BASE_BRANCH=gh/xxx/1234/base
if [[ "${CIRCLE_PR_BASE_BRANCH}" == "gh/"* ]]; then
CIRCLE_PR_BASE_BRANCH=master
fi
export GIT_MERGE_TARGET=`git log -n 1 --pretty=format:"%H" origin/$CIRCLE_PR_BASE_BRANCH`
echo "GIT_MERGE_TARGET: " ${GIT_MERGE_TARGET}
export GIT_COMMIT=${CIRCLE_SHA1}
echo "GIT_COMMIT: " ${GIT_COMMIT}
git checkout -f ${GIT_COMMIT}
git reset --hard ${GIT_COMMIT}
git merge --allow-unrelated-histories --no-edit --no-ff ${GIT_MERGE_TARGET}
echo "Merged $CIRCLE_PR_BASE_BRANCH branch before building in environment $BUILD_ENVIRONMENT"
set +x
else
echo "No need to merge with $CIRCLE_PR_BASE_BRANCH, skipping..."
fi
else
echo "This is not a pull request, skipping..."
fi

View File

@ -1,21 +0,0 @@
docker_build_job:
parameters:
image_name:
type: string
default: ""
machine:
image: ubuntu-1604:201903-01
resource_class: large
environment:
IMAGE_NAME: << parameters.image_name >>
steps:
- checkout
- run:
name: build_docker_image_<< parameters.image_name >>
no_output_timeout: "1h"
command: |
set +x
export AWS_ACCESS_KEY_ID=${CIRCLECI_AWS_ACCESS_KEY_FOR_DOCKER_BUILDER_V1}
export AWS_SECRET_ACCESS_KEY=${CIRCLECI_AWS_SECRET_KEY_FOR_DOCKER_BUILDER_V1}
set -x
cd .circleci/docker && ./build_docker.sh

View File

@ -1,21 +1,34 @@
# WARNING: DO NOT EDIT THIS FILE DIRECTLY!!!
# See the README.md in this directory.
# IMPORTANT: To update Docker image version, please first update
# https://github.com/pytorch/ossci-job-dsl/blob/master/src/main/groovy/ossci/pytorch/DockerVersion.groovy and
# https://github.com/pytorch/ossci-job-dsl/blob/master/src/main/groovy/ossci/caffe2/DockerVersion.groovy,
# and then update DOCKER_IMAGE_VERSION at the top of the following files:
# * cimodel/data/pytorch_build_definitions.py
# * cimodel/data/caffe2_build_definitions.py
# And the inline copies of the variable in
# * verbatim-sources/job-specs-custom.yml
# (grep for DOCKER_IMAGE)
# IMPORTANT: To update Docker image version, please follow
# the instructions at
# https://github.com/pytorch/pytorch/wiki/Docker-image-build-on-CircleCI
version: 2.1
parameters:
run_binary_tests:
type: boolean
default: false
docker_config_defaults: &docker_config_defaults
user: jenkins
aws_auth:
# This IAM user only allows read-write access to ECR
aws_access_key_id: ${CIRCLECI_AWS_ACCESS_KEY_FOR_ECR_READ_WRITE_V4}
aws_secret_access_key: ${CIRCLECI_AWS_SECRET_KEY_FOR_ECR_READ_WRITE_V4}
executors:
windows-with-nvidia-gpu:
machine:
resource_class: windows.gpu.nvidia.medium
image: windows-server-2019-nvidia:stable
shell: bash.exe
windows-cpu-with-nvidia-cuda:
machine:
# we will change to CPU host when it's ready
resource_class: windows.xlarge
image: windows-server-2019-vs2019:stable
shell: bash.exe

View File

@ -0,0 +1,14 @@
# There is currently no testing for libtorch TODO
# binary_linux_libtorch_3.6m_cpu_test:
# environment:
# BUILD_ENVIRONMENT: "libtorch 3.6m cpu"
# resource_class: gpu.medium
# <<: *binary_linux_test
#
# binary_linux_libtorch_3.6m_cu90_test:
# environment:
# BUILD_ENVIRONMENT: "libtorch 3.6m cu90"
# resource_class: gpu.medium
# <<: *binary_linux_test
#

View File

@ -2,7 +2,7 @@
<<: *binary_linux_build_params
steps:
# See Note [Workspace for CircleCI scripts] in job-specs-setup.yml
- should_run_job
- checkout
- run:
<<: *binary_checkout
- run:
@ -19,8 +19,8 @@
elif [[ "$OS_NAME" == *"Ubuntu"* ]]; then
retry apt-get update
retry apt-get -y install expect moreutils
conda install -y -c eumetsat expect
conda install -y cmake
retry conda install -y -c eumetsat expect
retry conda install -y cmake
fi
- run:
name: Update compiler to devtoolset7
@ -41,10 +41,28 @@
no_output_timeout: "1h"
command: |
source "/pytorch/.circleci/scripts/binary_linux_build.sh"
- run:
name: Output binary sizes
no_output_timeout: "1m"
command: |
ls -lah /final_pkgs
- run:
name: save binary size
no_output_timeout: "5m"
command: |
source /env
cd /pytorch && export COMMIT_TIME=$(git log --max-count=1 --format=%ct || echo 0)
pip3 install requests && \
SCRIBE_GRAPHQL_ACCESS_TOKEN=${SCRIBE_GRAPHQL_ACCESS_TOKEN} \
python3 /pytorch/.circleci/scripts/upload_binary_size_to_scuba.py || exit 0
- persist_to_workspace:
root: /
paths: final_pkgs
- store_artifacts:
path: /final_pkgs
# This should really just be another step of the binary_linux_build job above.
# This isn't possible right now b/c the build job uses the docker executor
# (otherwise they'd be really really slow) but this one uses the macine
@ -56,7 +74,7 @@
image: ubuntu-1604:201903-01
steps:
# See Note [Workspace for CircleCI scripts] in job-specs-setup.yml
- should_run_job
- checkout
# TODO: We shouldn't attach the workspace multiple times
- attach_workspace:
at: /home/circleci/project
@ -69,7 +87,7 @@
- run:
name: Prepare test code
no_output_timeout: "1h"
command: ~/workspace/.circleci/scripts/binary_linux_test.sh
command: .circleci/scripts/binary_linux_test.sh
- run:
<<: *binary_run_in_docker
@ -79,7 +97,7 @@
image: ubuntu-1604:201903-01
steps:
# See Note [Workspace for CircleCI scripts] in job-specs-setup.yml
- should_run_job
- checkout
- setup_linux_system_environment
- setup_ci_environment
- attach_workspace:
@ -91,7 +109,7 @@
- run:
name: Upload
no_output_timeout: "1h"
command: ~/workspace/.circleci/scripts/binary_linux_upload.sh
command: .circleci/scripts/binary_linux_upload.sh
# Nighlty build smoke tests defaults
# These are the second-round smoke tests. These make sure that the binaries are
@ -103,10 +121,7 @@
machine:
image: ubuntu-1604:201903-01
steps:
- attach_workspace:
at: ~/workspace
- attach_workspace:
at: /home/circleci/project
- checkout
- setup_linux_system_environment
- setup_ci_environment
- run:
@ -130,12 +145,9 @@
smoke_mac_test:
<<: *binary_linux_test_upload_params
macos:
xcode: "9.0"
xcode: "9.4.1"
steps:
- attach_workspace:
at: ~/workspace
- attach_workspace: # TODO - we can `cp` from ~/workspace
at: /Users/distiller/project
- checkout
- run:
<<: *binary_checkout
- run:
@ -158,10 +170,10 @@
binary_mac_build:
<<: *binary_mac_params
macos:
xcode: "9.0"
xcode: "9.4.1"
steps:
# See Note [Workspace for CircleCI scripts] in job-specs-setup.yml
- should_run_job
- checkout
- run:
<<: *binary_checkout
- run:
@ -199,10 +211,10 @@
binary_mac_upload: &binary_mac_upload
<<: *binary_mac_params
macos:
xcode: "9.0"
xcode: "9.4.1"
steps:
# See Note [Workspace for CircleCI scripts] in job-specs-setup.yml
- should_run_job
- checkout
- run:
<<: *binary_checkout
- run:
@ -227,7 +239,6 @@
steps:
- attach_workspace:
at: ~/workspace
- should_run_job
- checkout
- run_brew_for_ios_build
- run:
@ -247,15 +258,14 @@
- persist_to_workspace:
root: /Users/distiller/workspace/
paths: ios
binary_ios_upload:
binary_ios_upload:
<<: *pytorch_ios_params
macos:
xcode: "11.2.1"
steps:
- attach_workspace:
at: ~/workspace
- should_run_job
- checkout
- run_brew_for_ios_build
- run:
@ -265,3 +275,108 @@
script="/Users/distiller/project/.circleci/scripts/binary_ios_upload.sh"
cat "$script"
source "$script"
binary_windows_build:
<<: *binary_windows_params
parameters:
build_environment:
type: string
default: ""
executor:
type: string
default: "windows-cpu-with-nvidia-cuda"
executor: <<parameters.executor>>
steps:
# See Note [Workspace for CircleCI scripts] in job-specs-setup.yml
- checkout
- run:
<<: *binary_checkout
- run:
<<: *binary_populate_env
- run:
name: Build
no_output_timeout: "1h"
command: |
set -eux -o pipefail
script="/c/w/p/.circleci/scripts/binary_windows_build.sh"
cat "$script"
source "$script"
- persist_to_workspace:
root: "C:/w"
paths: final_pkgs
binary_windows_test:
<<: *binary_windows_params
parameters:
build_environment:
type: string
default: ""
executor:
type: string
default: "windows-cpu-with-nvidia-cuda"
executor: <<parameters.executor>>
steps:
- checkout
- attach_workspace:
at: c:/users/circleci/project
- run:
<<: *binary_checkout
- run:
<<: *binary_populate_env
- run:
name: Test
no_output_timeout: "1h"
command: |
set -eux -o pipefail
script="/c/w/p/.circleci/scripts/binary_windows_test.sh"
cat "$script"
source "$script"
binary_windows_upload:
<<: *binary_windows_params
docker:
- image: continuumio/miniconda
steps:
# See Note [Workspace for CircleCI scripts] in job-specs-setup.yml
- checkout
- attach_workspace:
at: /root/workspace
- run:
<<: *binary_checkout
- run:
<<: *binary_populate_env
- run:
name: Upload
no_output_timeout: "10m"
command: |
set -eux -o pipefail
script="/pytorch/.circleci/scripts/binary_windows_upload.sh"
cat "$script"
source "$script"
smoke_windows_test:
<<: *binary_windows_params
parameters:
build_environment:
type: string
default: ""
executor:
type: string
default: "windows-cpu-with-nvidia-cuda"
executor: <<parameters.executor>>
steps:
- checkout
- run:
<<: *binary_checkout
- run:
<<: *binary_populate_env
- run:
name: Test
no_output_timeout: "1h"
command: |
set -eux -o pipefail
export TEST_NIGHTLY_PACKAGE=1
script="/c/w/p/.circleci/scripts/binary_windows_test.sh"
cat "$script"
source "$script"

View File

@ -10,8 +10,7 @@
machine:
image: ubuntu-1604:201903-01
steps:
- attach_workspace:
at: ~/workspace
- checkout
- setup_linux_system_environment
- run:
<<: *binary_checkout
@ -28,6 +27,15 @@
# make sure it has the same upload folder as the job it's attached to. This
# function is idempotent, so it won't hurt anything; it's just a little
# unnescessary"
- run:
name: define PIP_UPLOAD_FOLDER
command: |
our_upload_folder=nightly/
# On tags upload to test instead
if [[ -n "${CIRCLE_TAG}" ]]; then
our_upload_folder=test/
fi
echo "export PIP_UPLOAD_FOLDER=${our_upload_folder}" >> ${BASH_ENV}
- run:
name: Update s3 htmls
no_output_timeout: "1h"
@ -42,55 +50,3 @@
}
retry pip install awscli==1.6
"/home/circleci/project/builder/cron/update_s3_htmls.sh"
# Update s3 htmls for the nightlies
update_s3_htmls_for_nightlies:
environment:
PIP_UPLOAD_FOLDER: "nightly/"
<<: *update_s3_htmls
# Update s3 htmls for the nightlies for devtoolset7
update_s3_htmls_for_nightlies_devtoolset7:
environment:
PIP_UPLOAD_FOLDER: "nightly/devtoolset7/"
<<: *update_s3_htmls
# upload_binary_logs job
# The builder hud at pytorch.org/builder shows the sizes of all the binaries
# over time. It gets this info from html files stored in S3, which this job
# populates every day.
upload_binary_sizes: &upload_binary_sizes
machine:
image: ubuntu-1604:201903-01
steps:
- attach_workspace:
at: ~/workspace
- setup_linux_system_environment
- run:
<<: *binary_checkout
- run:
<<: *binary_install_miniconda
- run:
name: Upload binary sizes
no_output_timeout: "1h"
command: |
set +x
echo "declare -x \"AWS_ACCESS_KEY_ID=${PYTORCH_BINARY_AWS_ACCESS_KEY_ID}\"" > /home/circleci/project/env
echo "declare -x \"AWS_SECRET_ACCESS_KEY=${PYTORCH_BINARY_AWS_SECRET_ACCESS_KEY}\"" >> /home/circleci/project/env
export DATE="$(date -u +%Y_%m_%d)"
retry () {
$* || (sleep 1 && $*) || (sleep 2 && $*) || (sleep 4 && $*) || (sleep 8 && $*)
}
source /home/circleci/project/env
set -eux -o pipefail
# This is hardcoded to match binary_install_miniconda.sh
export PATH="/home/circleci/project/miniconda/bin:$PATH"
# Not any awscli will work. Most won't. This one will work
retry conda create -qyn aws36 python=3.6
source activate aws36
pip install awscli==1.16.46
"/home/circleci/project/builder/cron/upload_binary_sizes.sh"

View File

@ -4,9 +4,8 @@
image: ubuntu-1604:201903-01
steps:
# See Note [Workspace for CircleCI scripts] in job-specs-setup.yml
- should_run_job
- setup_linux_system_environment
- checkout
- setup_linux_system_environment
- setup_ci_environment
- run:
name: Build
@ -64,7 +63,7 @@
image: ubuntu-1604:201903-01
steps:
# See Note [Workspace for CircleCI scripts] in job-specs-setup.yml
- should_run_job
- checkout
- setup_linux_system_environment
- setup_ci_environment
- run:
@ -124,10 +123,9 @@
caffe2_macos_build:
<<: *caffe2_params
macos:
xcode: "9.0"
xcode: "9.4.1"
steps:
# See Note [Workspace for CircleCI scripts] in job-specs-setup.yml
- should_run_job
- checkout
- run_brew_for_macos_build
- run:
@ -151,7 +149,7 @@
# Install Anaconda if we need to
if [ -n "${CAFFE2_USE_ANACONDA}" ]; then
rm -rf ${TMPDIR}/anaconda
curl -o ${TMPDIR}/conda.sh https://repo.continuum.io/miniconda/Miniconda${ANACONDA_VERSION}-latest-MacOSX-x86_64.sh
curl --retry 3 -o ${TMPDIR}/conda.sh https://repo.anaconda.com/miniconda/Miniconda${ANACONDA_VERSION}-latest-MacOSX-x86_64.sh
chmod +x ${TMPDIR}/conda.sh
/bin/bash ${TMPDIR}/conda.sh -b -p ${TMPDIR}/anaconda
rm -f ${TMPDIR}/conda.sh
@ -162,7 +160,7 @@
pip -q install numpy
# Install sccache
sudo curl https://s3.amazonaws.com/ossci-macos/sccache --output /usr/local/bin/sccache
sudo curl --retry 3 https://s3.amazonaws.com/ossci-macos/sccache --output /usr/local/bin/sccache
sudo chmod +x /usr/local/bin/sccache
export SCCACHE_BUCKET=ossci-compiler-cache-circleci-v2

View File

@ -0,0 +1,135 @@
docker_build_job:
parameters:
image_name:
type: string
default: ""
machine:
image: ubuntu-1604:201903-01
resource_class: large
environment:
IMAGE_NAME: << parameters.image_name >>
# Enable 'docker manifest'
DOCKER_CLI_EXPERIMENTAL: "enabled"
DOCKER_BUILDKIT: 1
steps:
- checkout
- run:
name: Calculate docker tag
command: |
set -x
mkdir .circleci/shared
# git keeps a hash of all sub trees
echo "export DOCKER_TAG=$(git rev-parse HEAD:.circleci/docker)" >> .circleci/shared/env_file
# Saves our calculated docker tag to our workpace for later use
- persist_to_workspace:
root: .
paths:
- .circleci/shared/
- load_shared_env:
root: .
- run:
name: Check if image should be built
command: |
set +x
export AWS_ACCESS_KEY_ID=${CIRCLECI_AWS_ACCESS_KEY_FOR_DOCKER_BUILDER_V1}
export AWS_SECRET_ACCESS_KEY=${CIRCLECI_AWS_SECRET_KEY_FOR_DOCKER_BUILDER_V1}
eval $(aws ecr get-login --no-include-email --region us-east-1)
set -x
PREVIOUS_DOCKER_TAG=$(git rev-parse "$(git merge-base HEAD << pipeline.git.base_revision >>):.circleci/docker")
# Check if image already exists, if it does then skip building it
if docker manifest inspect "308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/${IMAGE_NAME}:${DOCKER_TAG}"; then
circleci-agent step halt
# circleci-agent step halt doesn't actually halt the step so we need to
# explicitly exit the step here ourselves before it causes too much trouble
exit 0
fi
# If no image exists but the hash is the same as the previous hash then we should error out here
if [[ ${PREVIOUS_DOCKER_TAG} = ${DOCKER_TAG} ]]; then
echo "ERROR: Something has gone wrong and the previous image isn't available for the merge-base of your branch"
echo " contact the PyTorch team to restore the original images"
exit 1
fi
- run:
name: build_docker_image_<< parameters.image_name >>
no_output_timeout: "1h"
command: |
set +x
export AWS_ACCESS_KEY_ID=${CIRCLECI_AWS_ACCESS_KEY_FOR_DOCKER_BUILDER_V1}
export AWS_SECRET_ACCESS_KEY=${CIRCLECI_AWS_SECRET_KEY_FOR_DOCKER_BUILDER_V1}
set -x
cd .circleci/docker && ./build_docker.sh
docker_for_ecr_gc_build_job:
machine:
image: ubuntu-1604:201903-01
steps:
- checkout
- run:
name: build_docker_image_for_ecr_gc
no_output_timeout: "1h"
command: |
cd .circleci/ecr_gc_docker
docker build . -t 308535385114.dkr.ecr.us-east-1.amazonaws.com/gc/ecr
set +x
export AWS_ACCESS_KEY_ID=${CIRCLECI_AWS_ACCESS_KEY_FOR_DOCKER_BUILDER_V1}
export AWS_SECRET_ACCESS_KEY=${CIRCLECI_AWS_SECRET_KEY_FOR_DOCKER_BUILDER_V1}
eval $(aws ecr get-login --no-include-email --region us-east-1)
set -x
docker push 308535385114.dkr.ecr.us-east-1.amazonaws.com/gc/ecr
ecr_gc_job:
parameters:
project:
type: string
default: "pytorch"
tags_to_keep: # comma separate values
type: string
environment:
PROJECT: << parameters.project >>
# TODO: Remove legacy image tags once we feel comfortable with new docker image tags
IMAGE_TAG: << parameters.tags_to_keep >>
docker:
- image: 308535385114.dkr.ecr.us-east-1.amazonaws.com/gc/ecr
aws_auth:
aws_access_key_id: ${CIRCLECI_AWS_ACCESS_KEY_FOR_DOCKER_BUILDER_V1}
aws_secret_access_key: ${CIRCLECI_AWS_SECRET_KEY_FOR_DOCKER_BUILDER_V1}
steps:
- checkout
- run:
# NOTE: see 'docker_build_job' for how these tags actually get built
name: dynamically generate tags to keep
no_output_timeout: "1h"
command: |
GENERATED_IMAGE_TAG=$(\
git log --oneline --pretty='%H' .circleci/docker \
| xargs -I '{}' git rev-parse '{}:.circleci/docker' \
| paste -sd "," -)
echo "export GENERATED_IMAGE_TAG='${GENERATED_IMAGE_TAG}'" >> ${BASH_ENV}
- run:
name: garbage collecting for ecr images
no_output_timeout: "1h"
command: |
set +x
export AWS_ACCESS_KEY_ID=${CIRCLECI_AWS_ACCESS_KEY_FOR_DOCKER_BUILDER_V1}
export AWS_SECRET_ACCESS_KEY=${CIRCLECI_AWS_SECRET_KEY_FOR_DOCKER_BUILDER_V1}
set -x
/usr/bin/gc.py --filter-prefix ${PROJECT} --ignore-tags "${IMAGE_TAG},${GENERATED_IMAGE_TAG}"
docker_hub_index_job:
docker:
- image: 308535385114.dkr.ecr.us-east-1.amazonaws.com/gc/ecr
aws_auth:
aws_access_key_id: ${CIRCLECI_AWS_ACCESS_KEY_FOR_DOCKER_BUILDER_V1}
aws_secret_access_key: ${CIRCLECI_AWS_SECRET_KEY_FOR_DOCKER_BUILDER_V1}
steps:
- run:
name: garbage collecting for ecr images
no_output_timeout: "1h"
command: |
set +x
export AWS_ACCESS_KEY_ID=${CIRCLECI_AWS_ACCESS_KEY_FOR_DOCKER_BUILDER_V1}
export AWS_SECRET_ACCESS_KEY=${CIRCLECI_AWS_SECRET_KEY_FOR_DOCKER_BUILDER_V1}
export DOCKER_HUB_USERNAME=${CIRCLECI_DOCKER_HUB_USERNAME}
export DOCKER_HUB_PASSWORD=${CIRCLECI_DOCKER_HUB_PASSWORD}
set -x
/usr/bin/docker_hub.py

View File

@ -2,13 +2,12 @@
environment:
BUILD_ENVIRONMENT: pytorch-python-doc-push
# TODO: stop hardcoding this
DOCKER_IMAGE: "308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/pytorch-linux-xenial-cuda9-cudnn7-py3:405"
DOCKER_IMAGE: "308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/pytorch-linux-xenial-py3.6-gcc5.4:209062ef-ab58-422a-b295-36c4eed6e906"
resource_class: large
machine:
image: ubuntu-1604:201903-01
steps:
# See Note [Workspace for CircleCI scripts] in job-specs-setup.yml
- should_run_job
- checkout
- setup_linux_system_environment
- setup_ci_environment
- run:
@ -39,21 +38,26 @@
echo ${COMMAND} > ./command.sh && unbuffer bash ./command.sh | ts
mkdir -p ~/workspace/build_artifacts
docker cp $id:/var/lib/jenkins/workspace/pytorch.github.io/docs/master ~/workspace/build_artifacts
# Save the docs build so we can debug any problems
export DEBUG_COMMIT_DOCKER_IMAGE=${COMMIT_DOCKER_IMAGE}-debug
docker commit "$id" ${DEBUG_COMMIT_DOCKER_IMAGE}
time docker push ${DEBUG_COMMIT_DOCKER_IMAGE}
- store_artifacts:
path: ~/workspace/build_artifacts/master
destination: docs
pytorch_cpp_doc_push:
environment:
BUILD_ENVIRONMENT: pytorch-cpp-doc-push
DOCKER_IMAGE: "308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/pytorch-linux-xenial-cuda9-cudnn7-py3:405"
DOCKER_IMAGE: "308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/pytorch-linux-xenial-py3.6-gcc5.4:209062ef-ab58-422a-b295-36c4eed6e906"
resource_class: large
machine:
image: ubuntu-1604:201903-01
steps:
# See Note [Workspace for CircleCI scripts] in job-specs-setup.yml
- should_run_job
- checkout
- setup_linux_system_environment
- setup_ci_environment
- run:
@ -93,10 +97,8 @@
environment:
BUILD_ENVIRONMENT: pytorch-macos-10.13-py3-build
macos:
xcode: "9.0"
xcode: "9.4.1"
steps:
# See Note [Workspace for CircleCI scripts] in job-specs-setup.yml
- should_run_job
- checkout
- run_brew_for_macos_build
- run:
@ -107,7 +109,7 @@
export IN_CIRCLECI=1
# Install sccache
sudo curl https://s3.amazonaws.com/ossci-macos/sccache --output /usr/local/bin/sccache
sudo curl --retry 3 https://s3.amazonaws.com/ossci-macos/sccache --output /usr/local/bin/sccache
sudo chmod +x /usr/local/bin/sccache
export SCCACHE_BUCKET=ossci-compiler-cache-circleci-v2
@ -120,24 +122,20 @@
chmod a+x .jenkins/pytorch/macos-build.sh
unbuffer .jenkins/pytorch/macos-build.sh 2>&1 | ts
# copy with -a to preserve relative structure (e.g., symlinks), and be recursive
cp -a ~/project ~/workspace
- persist_to_workspace:
root: ~/workspace
root: /Users/distiller/workspace/
paths:
- miniconda3
- project
pytorch_macos_10_13_py3_test:
environment:
BUILD_ENVIRONMENT: pytorch-macos-10.13-py3-test
macos:
xcode: "9.0"
xcode: "9.4.1"
steps:
# See Note [Workspace for CircleCI scripts] in job-specs-setup.yml
# This workspace also carries binaries from the build job
- should_run_job
- checkout
- attach_workspace:
at: ~/workspace
- run_brew_for_macos_build
- run:
name: Test
@ -146,74 +144,22 @@
set -e
export IN_CIRCLECI=1
# copy with -a to preserve relative structure (e.g., symlinks), and be recursive
cp -a ~/workspace/project/. ~/project
chmod a+x .jenkins/pytorch/macos-test.sh
unbuffer .jenkins/pytorch/macos-test.sh 2>&1 | ts
- store_test_results:
path: test/test-reports
pytorch_macos_10_13_cuda9_2_cudnn7_py3_build:
environment:
BUILD_ENVIRONMENT: pytorch-macos-10.13-cuda9.2-cudnn7-py3-build
macos:
xcode: "9.0"
steps:
# See Note [Workspace for CircleCI scripts] in job-specs-setup.yml
- should_run_job
- checkout
- run_brew_for_macos_build
- run:
name: Build
no_output_timeout: "1h"
command: |
set -e
export IN_CIRCLECI=1
# Install CUDA 9.2
sudo rm -rf ~/cuda_9.2.64_mac_installer.app || true
curl https://s3.amazonaws.com/ossci-macos/cuda_9.2.64_mac_installer.zip -o ~/cuda_9.2.64_mac_installer.zip
unzip ~/cuda_9.2.64_mac_installer.zip -d ~/
sudo ~/cuda_9.2.64_mac_installer.app/Contents/MacOS/CUDAMacOSXInstaller --accept-eula --no-window
sudo cp /usr/local/cuda/lib/libcuda.dylib /Developer/NVIDIA/CUDA-9.2/lib/libcuda.dylib
sudo rm -rf /usr/local/cuda || true
# Install cuDNN 7.1 for CUDA 9.2
curl https://s3.amazonaws.com/ossci-macos/cudnn-9.2-osx-x64-v7.1.tgz -o ~/cudnn-9.2-osx-x64-v7.1.tgz
rm -rf ~/cudnn-9.2-osx-x64-v7.1 && mkdir ~/cudnn-9.2-osx-x64-v7.1
tar -xzvf ~/cudnn-9.2-osx-x64-v7.1.tgz -C ~/cudnn-9.2-osx-x64-v7.1
sudo cp ~/cudnn-9.2-osx-x64-v7.1/cuda/include/cudnn.h /Developer/NVIDIA/CUDA-9.2/include/
sudo cp ~/cudnn-9.2-osx-x64-v7.1/cuda/lib/libcudnn* /Developer/NVIDIA/CUDA-9.2/lib/
sudo chmod a+r /Developer/NVIDIA/CUDA-9.2/include/cudnn.h /Developer/NVIDIA/CUDA-9.2/lib/libcudnn*
# Install sccache
sudo curl https://s3.amazonaws.com/ossci-macos/sccache --output /usr/local/bin/sccache
sudo chmod +x /usr/local/bin/sccache
export SCCACHE_BUCKET=ossci-compiler-cache-circleci-v2
# This IAM user allows write access to S3 bucket for sccache
set +x
export AWS_ACCESS_KEY_ID=${CIRCLECI_AWS_ACCESS_KEY_FOR_SCCACHE_S3_BUCKET_V4}
export AWS_SECRET_ACCESS_KEY=${CIRCLECI_AWS_SECRET_KEY_FOR_SCCACHE_S3_BUCKET_V4}
set -x
git submodule sync && git submodule update -q --init --recursive
chmod a+x .jenkins/pytorch/macos-build.sh
unbuffer .jenkins/pytorch/macos-build.sh 2>&1 | ts
pytorch_android_gradle_build:
environment:
BUILD_ENVIRONMENT: pytorch-linux-xenial-py3-clang5-android-ndk-r19c-gradle-build
DOCKER_IMAGE: "308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/pytorch-linux-xenial-py3-clang5-android-ndk-r19c:405"
DOCKER_IMAGE: "308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/pytorch-linux-xenial-py3-clang5-android-ndk-r19c:209062ef-ab58-422a-b295-36c4eed6e906"
PYTHON_VERSION: "3.6"
resource_class: large
machine:
image: ubuntu-1604:201903-01
steps:
- should_run_job
- setup_linux_system_environment
- checkout
- setup_linux_system_environment
- setup_ci_environment
- run:
name: pytorch android gradle build
@ -247,7 +193,7 @@
export COMMAND='((echo "source ./workspace/env" && echo "sudo chown -R jenkins workspace") | docker exec -u jenkins -i "$id_arm_v7a" bash) 2>&1'
echo ${COMMAND} > ./command.sh && unbuffer bash ./command.sh | ts
mkdir ~/workspace/build_android_install_arm_v7a
mkdir -p ~/workspace/build_android_install_arm_v7a
docker cp $id_arm_v7a:/var/lib/jenkins/workspace/build_android/install ~/workspace/build_android_install_arm_v7a
# x86_64
@ -257,7 +203,7 @@
export COMMAND='((echo "source ./workspace/env" && echo "sudo chown -R jenkins workspace") | docker exec -u jenkins -i "$id_x86_64" bash) 2>&1'
echo ${COMMAND} > ./command.sh && unbuffer bash ./command.sh | ts
mkdir ~/workspace/build_android_install_x86_64
mkdir -p ~/workspace/build_android_install_x86_64
docker cp $id_x86_64:/var/lib/jenkins/workspace/build_android/install ~/workspace/build_android_install_x86_64
# arm-v8a
@ -267,7 +213,7 @@
export COMMAND='((echo "source ./workspace/env" && echo "sudo chown -R jenkins workspace") | docker exec -u jenkins -i "$id_arm_v8a" bash) 2>&1'
echo ${COMMAND} > ./command.sh && unbuffer bash ./command.sh | ts
mkdir ~/workspace/build_android_install_arm_v8a
mkdir -p ~/workspace/build_android_install_arm_v8a
docker cp $id_arm_v8a:/var/lib/jenkins/workspace/build_android/install ~/workspace/build_android_install_arm_v8a
docker cp ~/workspace/build_android_install_arm_v7a $id_x86_32:/var/lib/jenkins/workspace/build_android_install_arm_v7a
@ -284,6 +230,26 @@
output_image=$docker_image_libtorch_android_x86_32-gradle
docker commit "$id_x86_32" ${output_image}
time docker push ${output_image}
- run:
name: save binary size
no_output_timeout: "5m"
command: |
docker_image=${DOCKER_IMAGE}-${CIRCLE_SHA1}-android-x86_32-gradle
export id=$(docker run --cap-add=SYS_PTRACE --security-opt seccomp=unconfined -t -d -w /var/lib/jenkins ${docker_image})
echo "docker-id: $id"
cat \<< EOL | docker exec -u jenkins -i "$id" bash
# ============================== Begin Docker ==============================
cd workspace
source ./env
export ANDROID_BUILD_TYPE="prebuild"
export COMMIT_TIME=\$(git log --max-count=1 --format=%ct || echo 0)
export CIRCLE_BUILD_NUM="${CIRCLE_BUILD_NUM}"
export CIRCLE_SHA1="${CIRCLE_SHA1}"
export CIRCLE_BRANCH="${CIRCLE_BRANCH}"
export SCRIBE_GRAPHQL_ACCESS_TOKEN="${SCRIBE_GRAPHQL_ACCESS_TOKEN}"
python .circleci/scripts/upload_binary_size_to_scuba.py android
# ============================== End Docker ==============================
EOL
- store_artifacts:
path: ~/workspace/build_android_artifacts/artifacts.tgz
destination: artifacts.tgz
@ -291,13 +257,13 @@
pytorch_android_publish_snapshot:
environment:
BUILD_ENVIRONMENT: pytorch-linux-xenial-py3-clang5-android-ndk-r19c-gradle-publish-snapshot
DOCKER_IMAGE: "308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/pytorch-linux-xenial-py3-clang5-android-ndk-r19c:405"
DOCKER_IMAGE: "308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/pytorch-linux-xenial-py3-clang5-android-ndk-r19c:209062ef-ab58-422a-b295-36c4eed6e906"
PYTHON_VERSION: "3.6"
resource_class: large
machine:
image: ubuntu-1604:201903-01
steps:
- should_run_job
- checkout
- setup_linux_system_environment
- checkout
- setup_ci_environment
@ -327,13 +293,13 @@
pytorch_android_gradle_build-x86_32:
environment:
BUILD_ENVIRONMENT: pytorch-linux-xenial-py3-clang5-android-ndk-r19c-gradle-build-only-x86_32
DOCKER_IMAGE: "308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/pytorch-linux-xenial-py3-clang5-android-ndk-r19c:405"
DOCKER_IMAGE: "308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/pytorch-linux-xenial-py3-clang5-android-ndk-r19c:209062ef-ab58-422a-b295-36c4eed6e906"
PYTHON_VERSION: "3.6"
resource_class: large
machine:
image: ubuntu-1604:201903-01
steps:
- should_run_job
- checkout
- run:
name: filter out not PR runs
no_output_timeout: "5m"
@ -366,6 +332,26 @@
output_image=${docker_image_libtorch_android_x86_32}-gradle
docker commit "$id" ${output_image}
time docker push ${output_image}
- run:
name: save binary size
no_output_timeout: "5m"
command: |
docker_image=${DOCKER_IMAGE}-${CIRCLE_SHA1}-android-x86_32-gradle
export id=$(docker run --cap-add=SYS_PTRACE --security-opt seccomp=unconfined -t -d -w /var/lib/jenkins ${docker_image})
echo "docker-id: $id"
cat \<< EOL | docker exec -u jenkins -i "$id" bash
# ============================== Begin Docker ==============================
cd workspace
source ./env
export ANDROID_BUILD_TYPE="prebuild-single"
export COMMIT_TIME=\$(git log --max-count=1 --format=%ct || echo 0)
export CIRCLE_BUILD_NUM="${CIRCLE_BUILD_NUM}"
export CIRCLE_SHA1="${CIRCLE_SHA1}"
export CIRCLE_BRANCH="${CIRCLE_BRANCH}"
export SCRIBE_GRAPHQL_ACCESS_TOKEN="${SCRIBE_GRAPHQL_ACCESS_TOKEN}"
python .circleci/scripts/upload_binary_size_to_scuba.py android
# ============================== End Docker ==============================
EOL
- store_artifacts:
path: ~/workspace/build_android_x86_32_artifacts/artifacts.tgz
destination: artifacts.tgz
@ -375,10 +361,8 @@
macos:
xcode: "11.2.1"
steps:
# See Note [Workspace for CircleCI scripts] in job-specs-setup.yml
- should_run_job
- checkout
- run_brew_for_ios_build
- run_brew_for_ios_build
- run:
name: Run Fastlane
no_output_timeout: "1h"
@ -410,30 +394,44 @@
WORKSPACE=/Users/distiller/workspace
PROJ_ROOT=/Users/distiller/project
export TCLLIBPATH="/usr/local/lib"
# Install conda
curl -o ~/Downloads/conda.sh https://repo.anaconda.com/miniconda/Miniconda3-latest-MacOSX-x86_64.sh
chmod +x ~/Downloads/conda.sh
/bin/bash ~/Downloads/conda.sh -b -p ~/anaconda
curl --retry 3 -o ~/conda.sh https://repo.anaconda.com/miniconda/Miniconda3-latest-MacOSX-x86_64.sh
chmod +x ~/conda.sh
/bin/bash ~/conda.sh -b -p ~/anaconda
export PATH="~/anaconda/bin:${PATH}"
source ~/anaconda/bin/activate
# Install dependencies
conda install numpy ninja pyyaml mkl mkl-include setuptools cmake cffi typing requests --yes
retry () {
$* || (sleep 1 && $*) || (sleep 2 && $*) || (sleep 4 && $*) || (sleep 8 && $*)
}
retry conda install numpy ninja pyyaml mkl mkl-include setuptools cmake cffi typing requests --yes
# sync submodules
cd ${PROJ_ROOT}
git submodule sync
git submodule update --init --recursive
# export
export CMAKE_PREFIX_PATH=${CONDA_PREFIX:-"$(dirname $(which conda))/../"}
# run build script
chmod a+x ${PROJ_ROOT}/scripts/build_ios.sh
echo "IOS_ARCH: ${IOS_ARCH}"
echo "IOS_PLATFORM: ${IOS_PLATFORM}"
export BUILD_PYTORCH_MOBILE=1
#check the custom build flag
echo "SELECTED_OP_LIST: ${SELECTED_OP_LIST}"
if [ -n "${SELECTED_OP_LIST}" ]; then
export SELECTED_OP_LIST="${PROJ_ROOT}/ios/TestApp/custom_build/${SELECTED_OP_LIST}"
fi
export IOS_ARCH=${IOS_ARCH}
export IOS_PLATFORM=${IOS_PLATFORM}
unbuffer ${PROJ_ROOT}/scripts/build_ios.sh 2>&1 | ts
- run:
name: Run Build Tests
name: Run Build Test
no_output_timeout: "30m"
command: |
set -e
@ -445,7 +443,11 @@
exit 1
fi
echo ${IOS_DEV_TEAM_ID}
ruby ${PROJ_ROOT}/scripts/xcode_build.rb -i ${PROJ_ROOT}/build_ios/install -x ${PROJ_ROOT}/ios/TestApp/TestApp.xcodeproj -p ${IOS_PLATFORM} -c ${PROFILE} -t ${IOS_DEV_TEAM_ID}
if [ ${IOS_PLATFORM} != "SIMULATOR" ]; then
ruby ${PROJ_ROOT}/scripts/xcode_build.rb -i ${PROJ_ROOT}/build_ios/install -x ${PROJ_ROOT}/ios/TestApp/TestApp.xcodeproj -p ${IOS_PLATFORM} -c ${PROFILE} -t ${IOS_DEV_TEAM_ID}
else
ruby ${PROJ_ROOT}/scripts/xcode_build.rb -i ${PROJ_ROOT}/build_ios/install -x ${PROJ_ROOT}/ios/TestApp/TestApp.xcodeproj -p ${IOS_PLATFORM}
fi
if ! [ "$?" -eq "0" ]; then
echo 'xcodebuild failed!'
exit 1
@ -455,15 +457,14 @@
no_output_timeout: "2h"
command: |
set -e
if [ ${IOS_PLATFORM} != "SIMULATOR" ]; then
if [ ${IOS_PLATFORM} != "SIMULATOR" ]; then
echo "not SIMULATOR build, skip it."
exit 0
fi
WORKSPACE=/Users/distiller/workspace
PROJ_ROOT=/Users/distiller/project
source ~/anaconda/bin/activate
#install the latest version of PyTorch and TorchVision
pip install torch torchvision
pip install torch torchvision --progress-bar off
#run unit test
cd ${PROJ_ROOT}/ios/TestApp/benchmark
python trace_model.py
@ -471,4 +472,106 @@
cd ${PROJ_ROOT}/ios/TestApp
instruments -s -devices
fastlane scan
pytorch_linux_bazel_build:
<<: *pytorch_params
machine:
image: ubuntu-1604:201903-01
steps:
- checkout
- setup_linux_system_environment
- setup_ci_environment
- run:
name: Bazel Build
no_output_timeout: "1h"
command: |
set -e
# Pull Docker image and run build
echo "DOCKER_IMAGE: "${DOCKER_IMAGE}
time docker pull ${DOCKER_IMAGE} >/dev/null
export id=$(docker run --cap-add=SYS_PTRACE --security-opt seccomp=unconfined --cap-add=SYS_PTRACE --security-opt seccomp=unconfined -t -d -w /var/lib/jenkins ${DOCKER_IMAGE})
echo "Do NOT merge master branch into $CIRCLE_BRANCH in environment $BUILD_ENVIRONMENT"
git submodule sync && git submodule update -q --init --recursive
docker cp /home/circleci/project/. $id:/var/lib/jenkins/workspace
export COMMAND='((echo "export BUILD_ENVIRONMENT=${BUILD_ENVIRONMENT}" && echo "source ./workspace/env" && echo "sudo chown -R jenkins workspace && cd workspace && .jenkins/pytorch/build.sh") | docker exec -u jenkins -i "$id" bash) 2>&1'
echo ${COMMAND} > ./command.sh && unbuffer bash ./command.sh | ts
# Push intermediate Docker image for next phase to use
if [ -z "${BUILD_ONLY}" ]; then
# Augment our output image name with bazel to avoid collisions
output_image=${DOCKER_IMAGE}-bazel-${CIRCLE_SHA1}
export COMMIT_DOCKER_IMAGE=$output_image
docker commit "$id" ${COMMIT_DOCKER_IMAGE}
time docker push ${COMMIT_DOCKER_IMAGE}
fi
pytorch_linux_bazel_test:
<<: *pytorch_params
machine:
image: ubuntu-1604:201903-01
steps:
- checkout
- setup_linux_system_environment
- setup_ci_environment
- run:
name: Test
no_output_timeout: "90m"
command: |
set -e
output_image=${DOCKER_IMAGE}-bazel-${CIRCLE_SHA1}
export COMMIT_DOCKER_IMAGE=$output_image
echo "DOCKER_IMAGE: "${COMMIT_DOCKER_IMAGE}
time docker pull ${COMMIT_DOCKER_IMAGE} >/dev/null
if [ -n "${USE_CUDA_DOCKER_RUNTIME}" ]; then
export id=$(docker run --cap-add=SYS_PTRACE --security-opt seccomp=unconfined --runtime=nvidia -t -d -w /var/lib/jenkins ${COMMIT_DOCKER_IMAGE})
else
export id=$(docker run --cap-add=SYS_PTRACE --security-opt seccomp=unconfined -t -d -w /var/lib/jenkins ${COMMIT_DOCKER_IMAGE})
fi
retrieve_test_reports() {
echo "retrieving test reports"
docker cp -L $id:/var/lib/jenkins/workspace/bazel-testlogs ./ || echo 'No test reports found!'
}
trap "retrieve_test_reports" ERR
if [[ ${BUILD_ENVIRONMENT} == *"multigpu"* ]]; then
export COMMAND='((echo "export BUILD_ENVIRONMENT=${BUILD_ENVIRONMENT}" && echo "source ./workspace/env" && echo "sudo chown -R jenkins workspace && cd workspace && .jenkins/pytorch/multigpu-test.sh") | docker exec -u jenkins -i "$id" bash) 2>&1'
else
export COMMAND='((echo "export BUILD_ENVIRONMENT=${BUILD_ENVIRONMENT}" && echo "export CIRCLE_PULL_REQUEST=${CIRCLE_PULL_REQUEST}" && echo "source ./workspace/env" && echo "sudo chown -R jenkins workspace && cd workspace && .jenkins/pytorch/test.sh") | docker exec -u jenkins -i "$id" bash) 2>&1'
fi
echo ${COMMAND} > ./command.sh && unbuffer bash ./command.sh | ts
retrieve_test_reports
docker stats --all --no-stream
- store_test_results:
path: bazel-testlogs
pytorch_doc_test:
environment:
BUILD_ENVIRONMENT: pytorch-doc-test
# TODO: stop hardcoding this
DOCKER_IMAGE: "308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/pytorch-linux-xenial-py3.6-gcc5.4:209062ef-ab58-422a-b295-36c4eed6e906"
resource_class: medium
machine:
image: ubuntu-1604:201903-01
steps:
- checkout
- setup_linux_system_environment
- setup_ci_environment
- run:
name: Doc test
no_output_timeout: "30m"
command: |
set -ex
export COMMIT_DOCKER_IMAGE=${DOCKER_IMAGE}-${CIRCLE_SHA1}
echo "DOCKER_IMAGE: "${COMMIT_DOCKER_IMAGE}
time docker pull ${COMMIT_DOCKER_IMAGE} >/dev/null
export id=$(docker run --cap-add=SYS_PTRACE --security-opt seccomp=unconfined -t -d -w /var/lib/jenkins ${COMMIT_DOCKER_IMAGE})
export COMMAND='((echo "export BUILD_ENVIRONMENT=${BUILD_ENVIRONMENT}" && echo "export GITHUB_PYTORCHBOT_TOKEN=${GITHUB_PYTORCHBOT_TOKEN}" && echo "source ./workspace/env" && echo "sudo chown -R jenkins workspace && cd workspace && . ./.jenkins/pytorch/docs-test.sh") | docker exec -u jenkins -i "$id" bash) 2>&1'
echo ${COMMAND} > ./command.sh && unbuffer bash ./command.sh | ts

View File

@ -0,0 +1,18 @@
promote_s3:
<<: *promote_common
steps:
- checkout
- run:
name: Running promote script
command: |
scripts/release/promote/wheel_to_s3.sh
promote_conda:
<<: *promote_common
steps:
- checkout
- run:
name: Running promote script
command: |
scripts/release/promote/conda_to_conda.sh

View File

@ -27,4 +27,3 @@
- persist_to_workspace:
root: .
paths: .circleci/scripts

View File

@ -0,0 +1,304 @@
jobs:
pytorch_linux_build:
<<: *pytorch_params
machine:
image: ubuntu-1604:201903-01
steps:
# See Note [Workspace for CircleCI scripts] in job-specs-setup.yml
- checkout
- setup_linux_system_environment
- checkout
- optional_merge_target_branch
- setup_ci_environment
- run:
name: Build
no_output_timeout: "1h"
command: |
set -e
# Pull Docker image and run build
echo "DOCKER_IMAGE: "${DOCKER_IMAGE}
time docker pull ${DOCKER_IMAGE} >/dev/null
export id=$(docker run --cap-add=SYS_PTRACE --security-opt seccomp=unconfined --cap-add=SYS_PTRACE --security-opt seccomp=unconfined -t -d -w /var/lib/jenkins ${DOCKER_IMAGE})
git submodule sync && git submodule update -q --init --recursive
docker cp /home/circleci/project/. $id:/var/lib/jenkins/workspace
if [[ ${BUILD_ENVIRONMENT} == *"paralleltbb"* ]]; then
export PARALLEL_FLAGS="export ATEN_THREADING=TBB USE_TBB=1 "
elif [[ ${BUILD_ENVIRONMENT} == *"parallelnative"* ]]; then
export PARALLEL_FLAGS="export ATEN_THREADING=NATIVE "
fi
echo "Parallel backend flags: "${PARALLEL_FLAGS}
export COMMAND='((echo "export BUILD_ENVIRONMENT=${BUILD_ENVIRONMENT}" && echo '"$PARALLEL_FLAGS"' && echo "source ./workspace/env" && echo "sudo chown -R jenkins workspace && cd workspace && .jenkins/pytorch/build.sh") | docker exec -u jenkins -i "$id" bash) 2>&1'
echo ${COMMAND} > ./command.sh && unbuffer bash ./command.sh | ts
# Push intermediate Docker image for next phase to use
if [ -z "${BUILD_ONLY}" ]; then
# Note [Special build images]
# The xla build uses the same docker image as
# pytorch-linux-trusty-py3.6-gcc5.4-build. In the push step, we have to
# distinguish between them so the test can pick up the correct image.
output_image=${DOCKER_IMAGE}-${CIRCLE_SHA1}
if [[ ${BUILD_ENVIRONMENT} == *"xla"* ]]; then
export COMMIT_DOCKER_IMAGE=$output_image-xla
elif [[ ${BUILD_ENVIRONMENT} == *"libtorch"* ]]; then
export COMMIT_DOCKER_IMAGE=$output_image-libtorch
elif [[ ${BUILD_ENVIRONMENT} == *"paralleltbb"* ]]; then
export COMMIT_DOCKER_IMAGE=$output_image-paralleltbb
elif [[ ${BUILD_ENVIRONMENT} == *"parallelnative"* ]]; then
export COMMIT_DOCKER_IMAGE=$output_image-parallelnative
elif [[ ${BUILD_ENVIRONMENT} == *"android-ndk-r19c-x86_64"* ]]; then
export COMMIT_DOCKER_IMAGE=$output_image-android-x86_64
elif [[ ${BUILD_ENVIRONMENT} == *"android-ndk-r19c-arm-v7a"* ]]; then
export COMMIT_DOCKER_IMAGE=$output_image-android-arm-v7a
elif [[ ${BUILD_ENVIRONMENT} == *"android-ndk-r19c-arm-v8a"* ]]; then
export COMMIT_DOCKER_IMAGE=$output_image-android-arm-v8a
elif [[ ${BUILD_ENVIRONMENT} == *"android-ndk-r19c-x86_32"* ]]; then
export COMMIT_DOCKER_IMAGE=$output_image-android-x86_32
elif [[ ${BUILD_ENVIRONMENT} == *"android-ndk-r19c-vulkan-x86_32"* ]]; then
export COMMIT_DOCKER_IMAGE=$output_image-android-vulkan-x86_32
else
export COMMIT_DOCKER_IMAGE=$output_image
fi
docker commit "$id" ${COMMIT_DOCKER_IMAGE}
time docker push ${COMMIT_DOCKER_IMAGE}
fi
pytorch_linux_test:
<<: *pytorch_params
machine:
image: ubuntu-1604:201903-01
steps:
# See Note [Workspace for CircleCI scripts] in job-specs-setup.yml
- checkout
- setup_linux_system_environment
- setup_ci_environment
- run:
name: Download Docker image
no_output_timeout: "90m"
command: |
set -e
# See Note [Special build images]
output_image=${DOCKER_IMAGE}-${CIRCLE_SHA1}
if [[ ${BUILD_ENVIRONMENT} == *"xla"* ]]; then
export COMMIT_DOCKER_IMAGE=$output_image-xla
elif [[ ${BUILD_ENVIRONMENT} == *"libtorch"* ]]; then
export COMMIT_DOCKER_IMAGE=$output_image-libtorch
elif [[ ${BUILD_ENVIRONMENT} == *"paralleltbb"* ]]; then
export COMMIT_DOCKER_IMAGE=$output_image-paralleltbb
elif [[ ${BUILD_ENVIRONMENT} == *"parallelnative"* ]]; then
export COMMIT_DOCKER_IMAGE=$output_image-parallelnative
else
export COMMIT_DOCKER_IMAGE=$output_image
fi
echo "DOCKER_IMAGE: "${COMMIT_DOCKER_IMAGE}
if [[ ${BUILD_ENVIRONMENT} == *"paralleltbb"* ]]; then
export PARALLEL_FLAGS="export ATEN_THREADING=TBB USE_TBB=1 "
elif [[ ${BUILD_ENVIRONMENT} == *"parallelnative"* ]]; then
export PARALLEL_FLAGS="export ATEN_THREADING=NATIVE "
fi
echo "Parallel backend flags: "${PARALLEL_FLAGS}
time docker pull ${COMMIT_DOCKER_IMAGE} >/dev/null
if [ -n "${USE_CUDA_DOCKER_RUNTIME}" ]; then
export id=$(docker run --cap-add=SYS_PTRACE --security-opt seccomp=unconfined --runtime=nvidia -t -d -w /var/lib/jenkins ${COMMIT_DOCKER_IMAGE})
else
export id=$(docker run --cap-add=SYS_PTRACE --security-opt seccomp=unconfined -t -d -w /var/lib/jenkins ${COMMIT_DOCKER_IMAGE})
fi
# Pass environment variables to the next step
# See https://circleci.com/docs/2.0/env-vars/#using-parameters-and-bash-environment
echo "export PARALLEL_FLAGS=\"${PARALLEL_FLAGS}\"" >> $BASH_ENV
echo "export id=$id" >> $BASH_ENV
- run:
name: Check for no AVX instruction by default
no_output_timeout: "20m"
command: |
set -e
is_vanilla_build() {
if [ "${BUILD_ENVIRONMENT}" == "pytorch-linux-bionic-py3.6-clang9-test" ]; then
return 0
fi
if [ "${BUILD_ENVIRONMENT}" == "pytorch-linux-xenial-py3.6-gcc5.4-test" ]; then
return 0
fi
return 1
}
if is_vanilla_build; then
echo "apt-get update && apt-get install -y qemu-user" | docker exec -u root -i "$id" bash
echo "cd workspace/build; qemu-x86_64 -cpu Broadwell -E ATEN_CPU_CAPABILITY=default ./bin/basic --gtest_filter=BasicTest.BasicTestCPU" | docker exec -u jenkins -i "$id" bash
else
echo "Skipping for ${BUILD_ENVIRONMENT}"
fi
- run:
name: Run tests
no_output_timeout: "90m"
command: |
set -e
if [[ ${BUILD_ENVIRONMENT} == *"multigpu"* ]]; then
export COMMAND='((echo "export BUILD_ENVIRONMENT=${BUILD_ENVIRONMENT}" && echo "${PARALLEL_FLAGS}" && echo "source ./workspace/env" && echo "sudo chown -R jenkins workspace && cd workspace && .jenkins/pytorch/multigpu-test.sh") | docker exec -u jenkins -i "$id" bash) 2>&1'
else
export COMMAND='((echo "export BUILD_ENVIRONMENT=${BUILD_ENVIRONMENT}" && echo "export CIRCLE_PULL_REQUEST=${CIRCLE_PULL_REQUEST}" && echo "${PARALLEL_FLAGS}" && echo "source ./workspace/env" && echo "sudo chown -R jenkins workspace && cd workspace && .jenkins/pytorch/test.sh") | docker exec -u jenkins -i "$id" bash) 2>&1'
fi
echo ${COMMAND} > ./command.sh && unbuffer bash ./command.sh | ts
- run:
name: Report results
no_output_timeout: "5m"
command: |
set -e
docker stats --all --no-stream
echo "cd workspace; python test/print_test_stats.py test" | docker exec -u jenkins -i "$id" bash
echo "Retrieving test reports"
docker cp $id:/var/lib/jenkins/workspace/test/test-reports ./ || echo 'No test reports found!'
when: always
- store_test_results:
path: test-reports
pytorch_windows_build:
<<: *pytorch_windows_params
parameters:
executor:
type: string
default: "windows-cpu-with-nvidia-cuda"
build_environment:
type: string
default: ""
test_name:
type: string
default: ""
cuda_version:
type: string
default: "10"
python_version:
type: string
default: "3.6"
vc_version:
type: string
default: "14.11"
vc_year:
type: string
default: "2017"
vc_product:
type: string
default: "BuildTools"
use_cuda:
type: string
default: ""
executor: <<parameters.executor>>
steps:
- checkout
- run:
name: Install VS2017
command: |
if [[ "${VC_YEAR}" == "2017" ]]; then
powershell .circleci/scripts/vs_install.ps1
fi
- run:
name: Install Cuda
no_output_timeout: 30m
command: |
if [[ "${USE_CUDA}" == "1" ]]; then
.circleci/scripts/windows_cuda_install.sh
fi
- run:
name: Install Cudnn
command : |
if [[ "${USE_CUDA}" == "1" ]]; then
cd c:/
curl --retry 3 -O https://ossci-windows.s3.amazonaws.com/cudnn-10.1-windows10-x64-v7.6.4.38.zip
7z x cudnn-10.1-windows10-x64-v7.6.4.38.zip -ocudnn
cp -r cudnn/cuda/* "C:/Program Files/NVIDIA GPU Computing Toolkit/CUDA/v10.1/"
fi
- run:
name: Build
no_output_timeout: "90m"
command: |
set -e
set +x
export AWS_ACCESS_KEY_ID=${CIRCLECI_AWS_ACCESS_KEY_FOR_WIN_BUILD_V1}
export AWS_SECRET_ACCESS_KEY=${CIRCLECI_AWS_SECRET_KEY_FOR_WIN_BUILD_V1}
set -x
.jenkins/pytorch/win-build.sh
- persist_to_workspace:
root: "C:/w"
paths: build-results
- store_artifacts:
path: C:/w/build-results
pytorch_windows_test:
<<: *pytorch_windows_params
parameters:
executor:
type: string
default: "windows-cpu-with-nvidia-cuda"
build_environment:
type: string
default: ""
test_name:
type: string
default: ""
cuda_version:
type: string
default: "10"
python_version:
type: string
default: "3.6"
vc_version:
type: string
default: "14.11"
vc_year:
type: string
default: "2017"
vc_product:
type: string
default: "BuildTools"
use_cuda:
type: string
default: ""
executor: <<parameters.executor>>
steps:
- checkout
- attach_workspace:
at: c:/users/circleci/workspace
- run:
name: Install VS2017
command: |
if [[ "${VC_YEAR}" == "2017" ]]; then
powershell .circleci/scripts/vs_install.ps1
fi
- run:
name: Install Cuda
no_output_timeout: 30m
command: |
if [[ "${CUDA_VERSION}" != "cpu" && "${JOB_EXECUTOR}" != "windows-with-nvidia-gpu" ]]; then
.circleci/scripts/windows_cuda_install.sh
fi
- run:
name: Install Cudnn
command : |
if [[ "${CUDA_VERSION}" != "cpu" && "${JOB_EXECUTOR}" != "windows-with-nvidia-gpu" ]]; then
cd c:/
curl --retry 3 -O https://ossci-windows.s3.amazonaws.com/cudnn-10.1-windows10-x64-v7.6.4.38.zip
7z x cudnn-10.1-windows10-x64-v7.6.4.38.zip -ocudnn
cp -r cudnn/cuda/* "C:/Program Files/NVIDIA GPU Computing Toolkit/CUDA/v10.1/"
fi
- run:
name: Test
no_output_timeout: "30m"
command: |
set -e
export IN_CIRCLECI=1
set +x
export AWS_ACCESS_KEY_ID=${CIRCLECI_AWS_ACCESS_KEY_FOR_WIN_BUILD_V1}
export AWS_SECRET_ACCESS_KEY=${CIRCLECI_AWS_SECRET_KEY_FOR_WIN_BUILD_V1}
set -x
.jenkins/pytorch/win-test.sh
- store_test_results:
path: test/test-reports

View File

@ -26,18 +26,18 @@
# (smoke tests and upload jobs do not need the pytorch repo).
binary_checkout: &binary_checkout
name: Checkout pytorch/builder repo
command: ~/workspace/.circleci/scripts/binary_checkout.sh
command: .circleci/scripts/binary_checkout.sh
# Parses circleci arguments in a consistent way, essentially routing to the
# correct pythonXgccXcudaXos build we want
binary_populate_env: &binary_populate_env
name: Set up binary env variables
command: ~/workspace/.circleci/scripts/binary_populate_env.sh
command: .circleci/scripts/binary_populate_env.sh
binary_install_miniconda: &binary_install_miniconda
name: Install miniconda
no_output_timeout: "1h"
command: ~/workspace/.circleci/scripts/binary_install_miniconda.sh
command: .circleci/scripts/binary_install_miniconda.sh
# This section is used in the binary_test and smoke_test jobs. It expects
# 'binary_populate_env' to have populated /home/circleci/project/env and it
@ -47,4 +47,4 @@ binary_run_in_docker: &binary_run_in_docker
name: Run in docker
# This step only runs on circleci linux machine executors that themselves
# need to start docker images
command: ~/workspace/.circleci/scripts/binary_run_in_docker.sh
command: .circleci/scripts/binary_run_in_docker.sh

View File

@ -1,39 +0,0 @@
pytorch_params: &pytorch_params
parameters:
build_environment:
type: string
default: ""
docker_image:
type: string
default: ""
resource_class:
type: string
default: "large"
use_cuda_docker_runtime:
type: string
default: ""
environment:
BUILD_ENVIRONMENT: << parameters.build_environment >>
DOCKER_IMAGE: << parameters.docker_image >>
USE_CUDA_DOCKER_RUNTIME: << parameters.use_cuda_docker_runtime >>
resource_class: << parameters.resource_class >>
pytorch_ios_params: &pytorch_ios_params
parameters:
build_environment:
type: string
default: ""
ios_arch:
type: string
default: ""
ios_platform:
type: string
default: ""
environment:
BUILD_ENVIRONMENT: << parameters.build_environment >>
IOS_ARCH: << parameters.ios_arch >>
IOS_PLATFORM: << parameters.ios_platform >>

View File

@ -1,141 +0,0 @@
jobs:
pytorch_linux_build:
<<: *pytorch_params
machine:
image: ubuntu-1604:201903-01
steps:
# See Note [Workspace for CircleCI scripts] in job-specs-setup.yml
- should_run_job
- setup_linux_system_environment
- checkout
- setup_ci_environment
- run:
name: Build
no_output_timeout: "1h"
command: |
set -e
# Pull Docker image and run build
echo "DOCKER_IMAGE: "${DOCKER_IMAGE}
time docker pull ${DOCKER_IMAGE} >/dev/null
export id=$(docker run --cap-add=SYS_PTRACE --security-opt seccomp=unconfined --cap-add=SYS_PTRACE --security-opt seccomp=unconfined -t -d -w /var/lib/jenkins ${DOCKER_IMAGE})
# NB: Temporarily disable the rebase logic in v1.4.0, don't merge this change into master
# # TODO We may want to move the rebase logic to a separate step after checkout
# # Rebase to master only if in xenial_py3_6_gcc5_4 case
# if [[ "${CIRCLE_BRANCH}" != "master" && "${BUILD_ENVIRONMENT}" == *"gcc5"* ]]; then
# echo "Merge master branch into $CIRCLE_BRANCH before build in environment $BUILD_ENVIRONMENT"
# set -x
# git config --global user.email "circleci.ossci@gmail.com"
# git config --global user.name "CircleCI"
# git config remote.origin.url https://github.com/pytorch/pytorch.git
# git config --add remote.origin.fetch +refs/heads/master:refs/remotes/origin/master
# git fetch --tags --progress https://github.com/pytorch/pytorch.git +refs/heads/master:refs/remotes/origin/master --depth=100 --quiet
# export GIT_MERGE_TARGET=`git log -n 1 --pretty=format:"%H" origin/master`
# echo "GIT_MERGE_TARGET: " ${GIT_MERGE_TARGET}
# export GIT_COMMIT=${CIRCLE_SHA1}
# echo "GIT_COMMIT: " ${GIT_COMMIT}
# git checkout -f ${GIT_COMMIT}
# git reset --hard ${GIT_COMMIT}
# git merge --allow-unrelated-histories --no-edit --no-ff ${GIT_MERGE_TARGET}
# set +x
# else
# echo "Do NOT merge master branch into $CIRCLE_BRANCH in environment $BUILD_ENVIRONMENT"
# fi
git submodule sync && git submodule update -q --init --recursive
docker cp /home/circleci/project/. $id:/var/lib/jenkins/workspace
if [[ ${BUILD_ENVIRONMENT} == *"paralleltbb"* ]]; then
export PARALLEL_FLAGS="export ATEN_THREADING=TBB USE_TBB=1 "
elif [[ ${BUILD_ENVIRONMENT} == *"parallelnative"* ]]; then
export PARALLEL_FLAGS="export ATEN_THREADING=NATIVE "
fi
echo "Parallel backend flags: "${PARALLEL_FLAGS}
export COMMAND='((echo "export BUILD_ENVIRONMENT=${BUILD_ENVIRONMENT}" && echo '"$PARALLEL_FLAGS"' && echo "source ./workspace/env" && echo "sudo chown -R jenkins workspace && cd workspace && .jenkins/pytorch/build.sh") | docker exec -u jenkins -i "$id" bash) 2>&1'
echo ${COMMAND} > ./command.sh && unbuffer bash ./command.sh | ts
# Push intermediate Docker image for next phase to use
if [ -z "${BUILD_ONLY}" ]; then
# Note [Special build images]
# The xla build uses the same docker image as
# pytorch-linux-trusty-py3.6-gcc5.4-build. In the push step, we have to
# distinguish between them so the test can pick up the correct image.
output_image=${DOCKER_IMAGE}-${CIRCLE_SHA1}
if [[ ${BUILD_ENVIRONMENT} == *"xla"* ]]; then
export COMMIT_DOCKER_IMAGE=$output_image-xla
elif [[ ${BUILD_ENVIRONMENT} == *"libtorch"* ]]; then
export COMMIT_DOCKER_IMAGE=$output_image-libtorch
elif [[ ${BUILD_ENVIRONMENT} == *"android-ndk-r19c-x86_64"* ]]; then
export COMMIT_DOCKER_IMAGE=$output_image-android-x86_64
elif [[ ${BUILD_ENVIRONMENT} == *"android-ndk-r19c-arm-v7a"* ]]; then
export COMMIT_DOCKER_IMAGE=$output_image-android-arm-v7a
elif [[ ${BUILD_ENVIRONMENT} == *"android-ndk-r19c-arm-v8a"* ]]; then
export COMMIT_DOCKER_IMAGE=$output_image-android-arm-v8a
elif [[ ${BUILD_ENVIRONMENT} == *"android-ndk-r19c-x86_32"* ]]; then
export COMMIT_DOCKER_IMAGE=$output_image-android-x86_32
else
export COMMIT_DOCKER_IMAGE=$output_image
fi
docker commit "$id" ${COMMIT_DOCKER_IMAGE}
time docker push ${COMMIT_DOCKER_IMAGE}
fi
pytorch_linux_test:
<<: *pytorch_params
machine:
image: ubuntu-1604:201903-01
steps:
# See Note [Workspace for CircleCI scripts] in job-specs-setup.yml
- should_run_job
- setup_linux_system_environment
- setup_ci_environment
- run:
name: Test
no_output_timeout: "90m"
command: |
set -e
# See Note [Special build images]
output_image=${DOCKER_IMAGE}-${CIRCLE_SHA1}
if [[ ${BUILD_ENVIRONMENT} == *"xla"* ]]; then
export COMMIT_DOCKER_IMAGE=$output_image-xla
elif [[ ${BUILD_ENVIRONMENT} == *"libtorch"* ]]; then
export COMMIT_DOCKER_IMAGE=$output_image-libtorch
else
export COMMIT_DOCKER_IMAGE=$output_image
fi
echo "DOCKER_IMAGE: "${COMMIT_DOCKER_IMAGE}
if [[ ${BUILD_ENVIRONMENT} == *"paralleltbb"* ]]; then
export PARALLEL_FLAGS="export ATEN_THREADING=TBB USE_TBB=1 "
elif [[ ${BUILD_ENVIRONMENT} == *"parallelnative"* ]]; then
export PARALLEL_FLAGS="export ATEN_THREADING=NATIVE "
fi
echo "Parallel backend flags: "${PARALLEL_FLAGS}
time docker pull ${COMMIT_DOCKER_IMAGE} >/dev/null
if [ -n "${USE_CUDA_DOCKER_RUNTIME}" ]; then
export id=$(docker run --cap-add=SYS_PTRACE --security-opt seccomp=unconfined --runtime=nvidia -t -d -w /var/lib/jenkins ${COMMIT_DOCKER_IMAGE})
else
export id=$(docker run --cap-add=SYS_PTRACE --security-opt seccomp=unconfined -t -d -w /var/lib/jenkins ${COMMIT_DOCKER_IMAGE})
fi
retrieve_test_reports() {
echo "retrieving test reports"
docker cp $id:/var/lib/jenkins/workspace/test/test-reports ./ || echo 'No test reports found!'
}
trap "retrieve_test_reports" ERR
if [[ ${BUILD_ENVIRONMENT} == *"multigpu"* ]]; then
export COMMAND='((echo "export BUILD_ENVIRONMENT=${BUILD_ENVIRONMENT}" && echo "${PARALLEL_FLAGS}" && echo "source ./workspace/env" && echo "sudo chown -R jenkins workspace && cd workspace && .jenkins/pytorch/multigpu-test.sh") | docker exec -u jenkins -i "$id" bash) 2>&1'
else
export COMMAND='((echo "export BUILD_ENVIRONMENT=${BUILD_ENVIRONMENT}" && echo "${PARALLEL_FLAGS}" && echo "source ./workspace/env" && echo "sudo chown -R jenkins workspace && cd workspace && .jenkins/pytorch/test.sh") | docker exec -u jenkins -i "$id" bash) 2>&1'
fi
echo ${COMMAND} > ./command.sh && unbuffer bash ./command.sh | ts
retrieve_test_reports
- store_test_results:
path: test-reports

View File

@ -1,4 +0,0 @@
##############################################################################
# Daily binary build trigger
##############################################################################

View File

@ -1,101 +0,0 @@
# Binary builds (subset, to smoke test that they'll work)
#
# NB: If you modify this file, you need to also modify
# the binary_and_smoke_tests_on_pr variable in
# pytorch-ci-hud to adjust the list of whitelisted builds
# at https://github.com/ezyang/pytorch-ci-hud/blob/master/src/BuildHistoryDisplay.js
- binary_linux_build:
name: binary_linux_manywheel_2_7mu_cpu_devtoolset7_build
build_environment: "manywheel 2.7mu cpu devtoolset7"
requires:
- setup
docker_image: "pytorch/manylinux-cuda100"
- binary_linux_build:
name: binary_linux_manywheel_3_7m_cu100_devtoolset7_build
build_environment: "manywheel 3.7m cu100 devtoolset7"
requires:
- setup
docker_image: "pytorch/manylinux-cuda100"
- binary_linux_build:
name: binary_linux_conda_2_7_cpu_devtoolset7_build
build_environment: "conda 2.7 cpu devtoolset7"
requires:
- setup
docker_image: "pytorch/conda-cuda"
# This binary build is currently broken, see https://github_com/pytorch/pytorch/issues/16710
# - binary_linux_conda_3_6_cu90_devtoolset7_build
- binary_linux_build:
name: binary_linux_libtorch_2_7m_cpu_devtoolset7_shared-with-deps_build
build_environment: "libtorch 2.7m cpu devtoolset7"
requires:
- setup
libtorch_variant: "shared-with-deps"
docker_image: "pytorch/manylinux-cuda100"
- binary_linux_build:
name: binary_linux_libtorch_2_7m_cpu_gcc5_4_cxx11-abi_shared-with-deps_build
build_environment: "libtorch 2.7m cpu gcc5.4_cxx11-abi"
requires:
- setup
libtorch_variant: "shared-with-deps"
docker_image: "pytorch/pytorch-binary-docker-image-ubuntu16.04:latest"
# TODO we should test a libtorch cuda build, but they take too long
# - binary_linux_libtorch_2_7m_cu90_devtoolset7_static-without-deps_build
- binary_mac_build:
name: binary_macos_wheel_3_6_cpu_build
build_environment: "wheel 3.6 cpu"
requires:
- setup
- binary_mac_build:
name: binary_macos_conda_2_7_cpu_build
build_environment: "conda 2.7 cpu"
requires:
- setup
- binary_mac_build:
name: binary_macos_libtorch_2_7_cpu_build
build_environment: "libtorch 2.7 cpu"
requires:
- setup
- binary_linux_test:
name: binary_linux_manywheel_2_7mu_cpu_devtoolset7_test
build_environment: "manywheel 2.7mu cpu devtoolset7"
requires:
- setup
- binary_linux_manywheel_2_7mu_cpu_devtoolset7_build
docker_image: "pytorch/manylinux-cuda100"
- binary_linux_test:
name: binary_linux_manywheel_3_7m_cu100_devtoolset7_test
build_environment: "manywheel 3.7m cu100 devtoolset7"
requires:
- setup
- binary_linux_manywheel_3_7m_cu100_devtoolset7_build
docker_image: "pytorch/manylinux-cuda100"
use_cuda_docker_runtime: "1"
resource_class: gpu.medium
- binary_linux_test:
name: binary_linux_conda_2_7_cpu_devtoolset7_test
build_environment: "conda 2.7 cpu devtoolset7"
requires:
- setup
- binary_linux_conda_2_7_cpu_devtoolset7_build
docker_image: "pytorch/conda-cuda"
# This binary build is currently broken, see https://github_com/pytorch/pytorch/issues/16710
# - binary_linux_conda_3_6_cu90_devtoolset7_test:
- binary_linux_test:
name: binary_linux_libtorch_2_7m_cpu_devtoolset7_shared-with-deps_test
build_environment: "libtorch 2.7m cpu devtoolset7"
requires:
- setup
- binary_linux_libtorch_2_7m_cpu_devtoolset7_shared-with-deps_build
libtorch_variant: "shared-with-deps"
docker_image: "pytorch/manylinux-cuda100"
- binary_linux_test:
name: binary_linux_libtorch_2_7m_cpu_gcc5_4_cxx11-abi_shared-with-deps_test
build_environment: "libtorch 2.7m cpu gcc5.4_cxx11-abi"
requires:
- setup
- binary_linux_libtorch_2_7m_cpu_gcc5_4_cxx11-abi_shared-with-deps_build
libtorch_variant: "shared-with-deps"
docker_image: "pytorch/pytorch-binary-docker-image-ubuntu16.04:latest"

View File

@ -1,66 +0,0 @@
docker_build:
triggers:
- schedule:
cron: "0 15 * * 0"
filters:
branches:
only:
- master
jobs:
- docker_build_job:
name: "pytorch-linux-bionic-clang9-thrift-llvmdev"
image_name: "pytorch-linux-bionic-clang9-thrift-llvmdev"
- docker_build_job:
name: "pytorch-linux-xenial-cuda10-cudnn7-py3-gcc7"
image_name: "pytorch-linux-xenial-cuda10-cudnn7-py3-gcc7"
- docker_build_job:
name: "pytorch-linux-xenial-cuda10.1-cudnn7-py3-gcc7"
image_name: "pytorch-linux-xenial-cuda10.1-cudnn7-py3-gcc7"
- docker_build_job:
name: "pytorch-linux-xenial-cuda8-cudnn7-py2"
image_name: "pytorch-linux-xenial-cuda8-cudnn7-py2"
- docker_build_job:
name: "pytorch-linux-xenial-cuda8-cudnn7-py3"
image_name: "pytorch-linux-xenial-cuda8-cudnn7-py3"
- docker_build_job:
name: "pytorch-linux-xenial-cuda9-cudnn7-py2"
image_name: "pytorch-linux-xenial-cuda9-cudnn7-py2"
- docker_build_job:
name: "pytorch-linux-xenial-cuda9-cudnn7-py3"
image_name: "pytorch-linux-xenial-cuda9-cudnn7-py3"
- docker_build_job:
name: "pytorch-linux-xenial-cuda9.2-cudnn7-py3-gcc7"
image_name: "pytorch-linux-xenial-cuda9.2-cudnn7-py3-gcc7"
- docker_build_job:
name: "pytorch-linux-xenial-py2.7.9"
image_name: "pytorch-linux-xenial-py2.7.9"
- docker_build_job:
name: "pytorch-linux-xenial-py2.7"
image_name: "pytorch-linux-xenial-py2.7"
- docker_build_job:
name: "pytorch-linux-xenial-py3-clang5-android-ndk-r19c"
image_name: "pytorch-linux-xenial-py3-clang5-android-ndk-r19c"
- docker_build_job:
name: "pytorch-linux-xenial-py3-clang5-asan"
image_name: "pytorch-linux-xenial-py3-clang5-asan"
- docker_build_job:
name: "pytorch-linux-xenial-py3.5"
image_name: "pytorch-linux-xenial-py3.5"
- docker_build_job:
name: "pytorch-linux-xenial-py3.6-clang7"
image_name: "pytorch-linux-xenial-py3.6-clang7"
- docker_build_job:
name: "pytorch-linux-xenial-py3.6-gcc4.8"
image_name: "pytorch-linux-xenial-py3.6-gcc4.8"
- docker_build_job:
name: "pytorch-linux-xenial-py3.6-gcc5.4"
image_name: "pytorch-linux-xenial-py3.6-gcc5.4"
- docker_build_job:
name: "pytorch-linux-xenial-py3.6-gcc7.2"
image_name: "pytorch-linux-xenial-py3.6-gcc7.2"
- docker_build_job:
name: "pytorch-linux-xenial-py3.6-gcc7"
image_name: "pytorch-linux-xenial-py3.6-gcc7"
- docker_build_job:
name: "pytorch-linux-xenial-pynightly"
image_name: "pytorch-linux-xenial-pynightly"

View File

@ -1,56 +0,0 @@
- pytorch_linux_build:
name: nightly_pytorch_linux_xenial_py3_clang5_android_ndk_r19c_x86_32_build
build_environment: "pytorch-linux-xenial-py3-clang5-android-ndk-r19c-x86_32"
requires:
- setup
docker_image: "308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/pytorch-linux-xenial-py3-clang5-android-ndk-r19c:405"
filters:
branches:
only: nightly
- pytorch_linux_build:
name: nightly_pytorch_linux_xenial_py3_clang5_android_ndk_r19c_x86_64_build
build_environment: "pytorch-linux-xenial-py3-clang5-android-ndk-r19c-x86_64"
requires:
- setup
docker_image: "308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/pytorch-linux-xenial-py3-clang5-android-ndk-r19c:405"
filters:
branches:
only: nightly
- pytorch_linux_build:
name: nightly_pytorch_linux_xenial_py3_clang5_android_ndk_r19c_arm_v7a_build
build_environment: "pytorch-linux-xenial-py3-clang5-android-ndk-r19c-arm-v7a"
requires:
- setup
docker_image: "308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/pytorch-linux-xenial-py3-clang5-android-ndk-r19c:405"
filters:
branches:
only: nightly
- pytorch_linux_build:
name: nightly_pytorch_linux_xenial_py3_clang5_android_ndk_r19c_arm_v8a_build
build_environment: "pytorch-linux-xenial-py3-clang5-android-ndk-r19c-arm-v8a"
requires:
- setup
docker_image: "308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/pytorch-linux-xenial-py3-clang5-android-ndk-r19c:405"
filters:
branches:
only: nightly
- pytorch_android_gradle_build:
name: nightly_pytorch_linux_xenial_py3_clang5_android_ndk_r19c_android_gradle_build
requires:
- nightly_pytorch_linux_xenial_py3_clang5_android_ndk_r19c_x86_32_build
- nightly_pytorch_linux_xenial_py3_clang5_android_ndk_r19c_x86_64_build
- nightly_pytorch_linux_xenial_py3_clang5_android_ndk_r19c_arm_v7a_build
- nightly_pytorch_linux_xenial_py3_clang5_android_ndk_r19c_arm_v8a_build
filters:
branches:
only: nightly
- pytorch_android_publish_snapshot:
name: nightly_pytorch_linux_xenial_py3_clang5_android_ndk_r19c_x86_32_android_publish_snapshot
requires:
- nightly_pytorch_linux_xenial_py3_clang5_android_ndk_r19c_android_gradle_build
context: org-member
filters:
branches:
only: nightly

View File

@ -1,33 +0,0 @@
# Pytorch iOS binary builds
- binary_ios_build:
name: pytorch_ios_11_2_1_nightly_x86_64_build
build_environment: "libtorch-ios-11.2.1-nightly-x86_64-build"
context: org-member
ios_platform: "SIMULATOR"
ios_arch: "x86_64"
requires:
- setup
filters:
branches:
only: nightly
- binary_ios_build:
name: pytorch_ios_11_2_1_nightly_arm64_build
build_environment: "libtorch-ios-11.2.1-nightly-arm64-build"
context: org-member
ios_arch: "arm64"
ios_platform: "OS"
requires:
- setup
filters:
branches:
only: nightly
- binary_ios_upload:
build_environment: "libtorch-ios-11.2.1-nightly-binary-build-upload"
context: org-member
requires:
- setup
- pytorch_ios_11_2_1_nightly_x86_64_build
- pytorch_ios_11_2_1_nightly_arm64_build
filters:
branches:
only: nightly

View File

@ -1,11 +0,0 @@
#- binary_linux_libtorch_2.7m_cpu_test:
# requires:
# - binary_linux_libtorch_2.7m_cpu_build
#- binary_linux_libtorch_2.7m_cu90_test:
# requires:
# - binary_linux_libtorch_2.7m_cu90_build
#- binary_linux_libtorch_2.7m_cu100_test:
# requires:
# - binary_linux_libtorch_2.7m_cu100_build
# Nightly uploads

Some files were not shown because too many files have changed in this diff Show More